This paper presents a dynamically scheduled parallel DSP architecture for general purpose DSP computations. The architecture consists of multiple DSP processors and of one or more scheduling units. DSP applications are rst captured by stream ow graphs, and then stream ow graphs are statically mapped onto a parallel architecture. The ordering and starting time of DSP tasks are determined by the scheduling unit(s) using a dynamic scheduling algorithm.
Introduction
In recent years, signi cant improvements have been achieved in computational power of programmable digital signal processors. New advances in architecture and technology have enabled DSP processors to achieve throughput of up to 16.7 MIPS and 50 MFLOPS 14] . Their high performance, programmability and low cost make them ideal for implementation in a number of real-time DSP applications, such as speech detection 2] and speech encoding. Unfortunately, we have recently experienced an even greater increase in the computational requirements of DSP applications. For instance, a computation rate of 1 GFLOPS is typical for High De nition Digital Television applications.
Currently, the only means to meet the high throughput demands of DSP applications is with special hardware, which can be quite expensive and time consuming to build at the prototyping stage. Given the success of DSP processors, one approach to obtaining a greater computational power while maintaining a rapid prototyping capability is to employ multiple DSP processors working in parallel. As an example, a system of 20 Motorola DSP96002 DSPs can yield a peak throughput of 334 MIPS and 1 GFLOPS.
In addition, parallel processing of DSP applications is well supported by the fact that there is plenty of inherent parallelism in DSP applications. A DSP application often operates on virtually in nite streams of incoming data samples. So, pipelining the signal processing operations will speedup the throughput of DSP applications.
Large grain data ow graphs are natural representations of DSP applications even without the motivation of exposing the inherent parallelism. Data ow techniques have been used by DSP designers for decades in the guise of block diagram languages. A large grain data ow graph consists of a set of nodes and a set of arcs, where a node represents a fundamental DSP task (such as FIR ltering or FFT), and an arc represents the data dependency between two nodes. In the rest of the paper, we use the term node and task interchangeably.
Synchronous Data Flow (SDF) 13] is a representative data ow technique for DSP applications. The term synchronous means two things: (1) the amount of input data consumed by each task in the application, and the amount of output data produced are known at compile time and are invariable, (2) the execution time of tasks is data independent and known at compile time. This xed execution pattern allows a multiprocessor scheduler to produce a schedule at compile time. Since a static schedule can be obtained, DSP applications described by SDF graphs are implemented on statically scheduled parallel architectures. Examples of statically scheduled parallel DSP architectures include MOMA 12] , a distributed memory machines in 16] , and a 16-processor MIMD machine built of commercial DSP chips and connected by FIFO queues in 17].
DSP applications captured by SDF are just a subclass of all possible DSP applications. It should not be inferred that the use of synchronous fundamental DSP tasks is su cient to perform every practical DSP application. The problem is that there is always a substantial amount of computation that does not fall nicely into such structures, particularly, tasks involving heuristic decision making and data dependent conditionals where a steady streaming of data throughput can not be utilized 1].
To solve the problem, Stream Flow 8, 9] extends SDF technique by including two data ow nodes that are not synchronous | the switch node and the merge node. Further, in stream ow, the execution time of nodes is not assumed to be xed. Di erent from DSP applications captured by SDF graphs, DSP applications described by stream ow graphs can not be implemented on statically scheduled parallel DSP architectures. Due to the introduction of switch and merge nodes, and data dependent execution time, static schedules can not be obtained at compile time. The ordering and timing of stream ow graph execution can only be determined at run-time. So, implementing stream ow graphs requires dynamically scheduled parallel DSP architectures.
In a dynamically scheduled parallel architecture, the execution sequence of nodes is determined at run-time by the presence of data at the inputs of a node and the availability of storage space for the outputs.
In this paper, we propose a dynamically scheduled parallel DSP architecture for stream ow programming. The proposed parallel DSP architecture consists of tightly coupled multiple DSP processors and one or more scheduling units. A node is enabled for execution whenever there is su cient input data and storage space for output data. When a processor is free, an enabled node may be scheduled for execution on the processor. The scheduling task is carried out on the dedicated scheduling unit outside the DSP processors.
One obvious advantage of the dynamically scheduled parallel DSP architecture is that it can be used to implement general DSP applications which have decision making and data dependent computations. Another advantage is the capability of graceful performance degradation when a DSP processor breaks down. Such a feature is especially desirable in real-time applications requiring high performance in terms of reliability.
The main contributions of this paper are summarized as follows:
A scalable parallel DSP architecture: The parallel DSP architecture proposed in this paper is scalable to meet signal processing requirements. When more DSP processors are used, the scheduling unit may become a performance bottleneck. A distributed scheduling mechanism is proposed to address this problem.
A mapping algorithm: A algorithm is proposed to systematically map a stream ow graph onto a parallel DSP architecture.
A dynamic scheduling algorithm: We propose a dynamic scheduling algorithm that will only schedule a node for execution when both input data and output storage space are available. Such scheduling algorithm will allow bu er sizes to be determined at compile time.
Simulation study: Our simulation study reveals the relationships among the grain-size, the processor utilization, and the scheduling capability. We believe these relationships have signi cant impacts on parallel computer architecture design involving dynamic scheduling.
The paper is organized as follows. Section 2 gives the notation and abbreviations used in this paper. The stream ow programming model is described in Section 3. Dynamically scheduled DSP architecture with a single DSP processor is detailed in Section 4 and parallel DSP architectures are proposed in Section 5. Implementing DSP applications requires mapping stream ow graphs onto the proposed parallel DSP architecture. The mapping problem is addressed in Section 6. Simulation results are presented and discussed in Section 7. Related works are addressed in Section 8. Finally, we summarize and point out future research directions in Section 9.
Preliminaries
The following notation and abbreviations are used throughout this paper. Table. NPAP: Node Program Address Pointer.
Basic Notations
G
Stream Flow Programming Model
Stream ow programming model, as a representation of DSP applications, consists of two parts: capturing DSP applications with stream ow graphs, and the enabling rule that guides the execution of stream ow graphs. In this section, two kinds of nodes of stream ow graphs are de ned; a stream ow graph analysis technique is described; and nally the enabling rule is presented.
Regular Nodes and Non-Regular Nodes
A stream ow graph consists of a set of nodes and a set of arcs, where a node represents a task and an arc represents the data dependency between two nodes. Stream ow graph di ers from other data ow techniques in that it contains two kinds of nodes: regular and non-regular nodes, which we will de ne in this section.
Regular Nodes
Suppose a node has i input arcs, (a in 1 ; a in 2 ; :::; a in i ) and j output arcs, (a out 1 ; a out 2 ; :::; a out j ). When the node res, it consumes m 1 ; m 2 ; :::; m i tokens from each of the input arcs respectively and produces n 1 ; n 2 ; :::; n j tokens on each of the output arcs respectively. If m 1 ; m 2 ; :::; m i , and n 1 ; n 2 ; :::; n j are predetermined in compile time, then the node is said to be a regular node. Figure 1 depicts a regular node. The number of data tokens consumed and produced by the node on each arc is marked near the arc; for example, m 1 indicates the number of tokens consumed from input arc a in1 when the node is red. The di erence between a synchronous node in SDF and a regular node in stream ow is that the execution time of a regular node is not speci ed, but could be variable. The expressive power of regular nodes is limited. For instance, we cannot express simple conditionals using only regular nodes.
Non-Regular Nodes
Since the expressive power of regular nodes is limited, we can express conditionals, recursions, as well as loops by adding two non-regular nodes, namely switch and merge.
The switch node has one input arc, one control arc, and two output arcs which are designated as T and F. The data tokens presented on the control arc are boolean tokens. If the boolean token is true, the input data are copied to the T output arc, otherwise to the F arc.
The merge node has two input arcs which are designated as T and F, one control arc, and one output arc. If the boolean token presented at the control arc is true, then the data at the T arc are copied to the output arc; otherwise the the data at the F arc are copied to the output arc.
Switch and merge nodes are not regular nodes because the number of data tokens consumed and produced is determined during the run-time according to the value of boolean tokens on the control arcs. Since only switch and merge nodes are non-regular nodes, when a non-regular nodes appears, we will use switch and merge speci cally, and simply use node to mean regular node. 
Stream Flow Graph Analysis
We learned that a node may consume and produce multiple data tokens when it res. Arbitrary interconnection of nodes may lead to data token accumulation in some arcs. Let's examine an example in Figure 3 . When each of M 1 , M 2 and M 3 res once, there will be a token accumulated in arc a0. Since the stream ow graph is supposed to be applied to in nite input data, the graph in Figure 3 will result in requiring in nite memory to hold the accumulated data. 
Enabling Rule
The enabling rule gives the condition under which a node may be executed to achieve correct computation result. In traditional data ow techniques, a node is enabled when there is su cient input data. However, the enabling rule used by the stream ow programming model considers both input data and output storage space. A node is enabled only when there is su cient input data and output storage space. Formally, the enabling rule is de ned as follows.
De nition 3.1 (Enabling Rule) Consider a node with i input arcs and j output arcs.
The node becomes enabled when each input arc a in k ; k = 1; 2; :::; i has at least m k tokens, and for each output arc a out l ; l = 1; 2; :::; j, there are at least n l empty slots.
We will see in the next section that the enabling rule permits us to construct a dynamic scheduling algorithm and determine the memory requirements when implementing DSP applications captured by stream ow graphs on our parallel DSP architecture. The single processor architecture consists of a DSP processor which is designated as signal processing unit (SPU) and a dynamic scheduler which is designated as scheduling unit (SU) as shown in Figure 5 . SPU executes enabled nodes, and SU determines the order and the starting time of the node execution. Our architecture is referred to as dynamically scheduled architecture, because the ordering and the timing are determined at run time by SU according the availability of enabled nods and the SPU. SPU has a program memory (PM), a data memory (DM) and a main data path (MDP). SU consists of a scheduling processor (SP), a signal graph memory (SGM) and node status registers (NSR).
The whole procedure of program execution is described step by step. First of all, a DSP application captured by a stream ow graph is compiled into a tuple (P; S). The P-portion, is a set of node programs, which are sequences of instructions performing fundamental DSP operations; while the S-portion keeps the structure of the stream ow graph, and is kept in SU for scheduling purpose. Then SPU starts to execute a node program when the rst instruction address of the node program is presented at the re link from SU to SPU. During the execution of the node program, SPU fetches input data from DM, operates on the data following the instructions, and puts output data back to DM which is to be consumed by successive nodes. In our architecture, bu er memories are allocated to arcs to allow data be read from and written to DM. After the termination of the execution, SPU sends the rst instruction address of the node program through the done link to SU to ag the completion of processing, and is ready to process another enabled node. Finally, upon receiving the rst instruction address of the terminated node program, SU will apply the enabling rule, nd newly enabled nodes, and send the address of the rst instruction of one of the enabled node programs to SPU through re link.
As described in the above, the rst instruction addresses are used to execute ( re) node programs and ag the termination (done) of node programs. The rst instruction addresses can be called re tokens when they are presented on the re link, or done tokens when presented on the done link.
To facilitate the reception 34 of re tokens, a special register, named the node program address pointer (NPAP) is used. A re token is rst loaded into the NPAP. NPAP is then used to fetch the rst instruction of the node program. To initiate the execution of a node program, two instructions are required.
MOVE NPAP, Rn; JMP Rn;
The MOVE instruction copies the contents of NPAP to an address register, R n , and the JMP instruction branches to the rst instruction of the node program.
SPU and SU interacts with each other through the ow of re and done tokens. The mechanism of accepting a new re token and sending out a done token is triggered by the termination event of current node program. The diagram depicting the implementation of SU and SPU interface is shown in Figure 6 .
The following steps describe the procedure of starting a node program in SPU:
1. Apply the contents of NPAP to the data bus.
2. Move the contents of the transmit register to done token queue.
3. Copy the contents of NPAP to transmitter register. 4 . Move a new re token into NPAP from the top of the re token queue.
Scheduling Unit
As indicated in the above section, the ordering and the starting time of node program execution is determined by SU at run-time. SU starts dynamic scheduling by checking if there is any node enabled as a consequence of the completed node. In order to determine which node is enabled, SU needs to know the structure of the stream ow graph being processed. This section describes how to compile stream ow graphs in SU, followed by a description of the dynamic scheduling algorithm, and some precautions in using the dynamic algorithm.
An arc has four parameters: (1) the number of data tokens produced by the node that put data to the arc (producer), N p ; (2) the number of data tokens consumed by the node that removes data from the arc (consumer), N c ; (3) the bu er size allocated to the arc, N s ; and (4) the number of data tokens left in the bu er, N b .
An arc has three states which are used to determine which node can be enabled: If all arcs connected to a node are in the state to allow the node to re, this node is enabled to re.
Stream Flow Graph Compilation in SU
A stream ow graph is compiled in SU in the following ways in this paper: (1) the structure of the stream ow graph is represented by two arrays: the node array and the arc array; (2) the number of data tokens consumed and produced by node programs is compiled into Scheduling Data Table (SDT); (3) Whether a node can be enabled is nally determined by the contents of the node status registers (NSR). ARC NODE(3) NODE (1) NODE (2) NODE (4) NODE (6) NODE (5) NODE (7) NODE ( Figure 7: Spectrum analysis and its structure representation
The precedence relationship are expressed by the node array NODE(1 : N v ), and the arc array ARC(1 : 2N e ). The i-th element in the node array, NODE(i) ( 1 i N v ) , gives the starting point of the arc list connected to the i-th node. The arcs connected to the i-th node are stored in ARC(j), where NODE(i) j < NODE(i + 1 ). ARC(j) has two items: a pointer to the Scheduling Data Table (SDT) and a producer=consumer (p=c) bit. If the p=c bit is '1', the i-th node is a producer, otherwise a consumer. We use spectrum analysis to explain how to compile a stream ow graph in SU which is shown in Figure 7 .
Each arc has an entry in SDT, containing the location of the producer and the consumer status registers, N s , N p , N c and N b . The SDT of the sample graph is shown in Table 1 . Each node has a NSR to indicate whether it can be enabled. NSR has three elds | the enable count, the reset count and the the rst instruction address of the node program in PM. When the enable count of a node is decremented to 0, the node is enabled, the rst instruction address of the node program is sent to SPU as a re token, and the enable count is set to the value of the reset count. The reset count is a constant number, and equals to the degree of the node, which means that the node has to receive permission to re from each of its connected arcs. The NSR for the sample stream ow graph is shown in Table 2 . The enable count values in Table 2 is obtained by assuming that there is no data token in all arcs at the beginning of stream ow graph execution. 
Dynamic Scheduling Algorithm
The dynamic scheduling algorithm used by SU is described as follows. If the enable count reaches zero, the consumer is enabled, the re token of the consumer sent to the re token queue, and the enable count set to the value of its reset count.
If N S ? N b > N p , the producer enable count is fetched and decremented by one.
If the enable count counts zero, the producer is enabled, the re token of the producer sent to the re token queue, and the enable count set to the value of its reset count.
6. NODE(i) NODE(i) + 1. If NODE(i) < NODE(i + 1), then go to step 2, otherwise continue.
7. If the done token queue is not empty, get another done token and go to step 1; otherwise wait for done token.
Precautions
At this point, it is necessary to point out that the dynamic scheduling algorithm has a potential danger of violating the enabling rule. Let's examine the following scenario: An updated arc is in the state that allows both consumer and producer to re. Then both the enable counts of both consumer and producer will be be decremented by one. Suppose the producer is enabled to re and the consumer's enable count has a value of 1. The consumer waits for one more count down signal from another arc. The producer may complete execution before the consumer's enable count is decremented to 0. Now the arc has de nitely enough data for consumer to re. Unfortunately, this will make the consumer's enable count be decremented to 0, causing the consumer to re before receiving permission to re from the right arc. Obviously, the ring of consumer violates the enabling rule and results in an incorrect computation! A double-count bit, D-bit, is introduced to each entry of SDT to prevent violations of enabling rule. The usage of D-bit is described as follows: a D-bit and preset it to 0.
If an updated arc allows both consumer and producer to re and D-bit is 0, send each node a count down signal and set the D-bit to 1.
When the arc is updated again, { D-bit is set, and p=c bit is 1:
If both producer and consumer are allowed to re, a count down signal is sent to producer only and keep D-bit set; If only consumer is allowed to re, D-bit is set to 0.
{ D-bit is set, and p=c bit is 0:
If both consumer and producer are allowed to re, a count down signal is sent to consumer only and keep D-bit set; If only producer is allowed to re, D-bit is set to 0.
If the D-bit is used, a node will not receive more than one count down signal continuously from one of its arc. Thus, correct execution sequence is enforced.
Determination of the Size of Bu ers
Our focus has been on the architectural design, yet the size of bu ers allocated to arcs is an important factor in determining the states of arcs and in uences the ordering and timing of node program execution. Now we discuss how to determine bu er sizes.
In the stream ow model, a node is enabled if two conditions are met: (1) there is enough data tokens at input arcs, (2) there is enough storage space at output arcs. Such enabling rule permits us to determine the bu er size allocated to arcs at compile time, although dynamic scheduling algorithm is used. First, we de ne the deadlock in stream ow graph execution.
De nition 4.1 A deadlock happens when the bu er allocated to an arc has neither su cient data tokens to enable the consumer to re nor enough empty space for the producer to re.
In implementing DSP applications, the allocated bu er size must be large enough to avoid deadlocks. For a stream ow graph having solutions to the state equations, we can easily determine the bu er size that will not cause deadlocks. Suppose the producer of an arc produces N p data tokens when ring, and the consumer consumes N c when ring, then we have S p N p = S c N c where S p and S c are the solution for the producer and the consumer respectively. In worst case, the producer is ordered to execute continuously S p times, followed by S c time execution of the consumer. So, a bu er size of S p N p is enough to sustain stream ow graph execution without causing deadlocks.
The minimum bu er size can be found by lazy evaluation of the stream ow graph, where a node is not enabled if, at its output arcs, there is su cient data for consumption by its immediate successors. By examining the pro le of the number of data tokens in an arc during lazy evaluation, we may determine the minimum bu er size to be the maximum value of the arc's pro le.
Parallel DSP Architectures
So far, we have addressed the design of single processor architecture for stream ow programming. As stated in the introduction section, multiple DSP processors are used to implement high throughput DSP applications. In this section, we address the architectural design of a multiple-processor system. Our parallel DSP architecture is composed of multiple SPUs and one SU. There is no restriction on what interconnection topology the parallel architecture has to assume. However, we are specially interested in looking into linearly connected parallel DSP architectures to speedup the throughput of the DSP applications by pipelining. As we described earlier, pipelining is well supported by the fact that DSP applications are applied to in nite streams of incoming data. When implementing a stream ow graph on our parallel DSP architecture, more SPUs are added if more processing power is required to meet real-time constraints, such as the signal rate. Since the architecture can be expanded by adding SPUs, SU will likely become a performance bottleneck in large con gurations. Therefore it is desirable to increase the processing capability of SU.
Implementation of Distributed Scheduling Mechanism
In this section, we propose a distributed scheduling mechanism, which is used to increase the scheduling capability and avoid performance bottlenecks. Here, SPUs are partitioned into clusters, with one SU handling the scheduling processing of one cluster. A stream ow graph is partitioned into sub-graphs and assigned to clusters of SPUs. Since SU belonging to a cluster is only responsible for scheduling the sub-graph assigned to the cluster, the scheduling load is distributed among a number of SUs. However, a SU belonged to a cluster has to know whether the input data to the assigned sub-graph is available from the cluster in the previous pipelining stage, and whether output data produced by the sub-graph have been consumed by the cluster in the next pipelining stage. So, SUs have to exchange scheduling data (done tokens) over partitions with neighboring clusters.
Exchanging scheduling data is facilitated by adding two full-duplex signal lines to a SU. One of the lines is connected to the SU of the cluster in the previous stage, while the other is connected to that of the next stage. The local SU accepts done tokens from three sources | the local cluster, the cluster in the previous stage, and the cluster in the next stage. The block diagram of the circuitry performing done token exchange mechanism is shown in Figure 9 .
A SPU simultaneously sends done tokens to the SU of the cluster in the previous stage and the SU of the cluster in the next stage. Note that only the done tokens of the nodes at the front and end partitions are exchanged. Therefore, two circuit modules are used to check the done tokens from the local cluster whether they belong to the front or end partition. The done tokens of those nodes in the front partition are stored in an identity memory of the front check module. Those of the rear end partition are stored in the rear check module. If a match is found with those done tokens stored in the identity memory, a copy of the incoming done token is sent to the cluster of either the previous stage or the next. 
Mapping Stream Flow Graphs
Implementing a stream ow graph on a parallel DSP architecture requires that the graph be partitioned and assigned to an individual processor. Finding a partition and assigning processors is called mapping. We rst formulate the mapping problem, then present a systematic mapping algorithm for implementing stream ow graph on our parallel DSP architecture.
Problem Formulation
A node, M i , in a stream ow graph has an associated execution time, say w i > 0. Suppose M i has a directed arc to M i+1 . At the end of M i execution period, data tokens are available for consumption by M i+1 . M i+1 may access these data without delay if both node reside in one SPU. If instead, these data are produced in a di erent SPU, there must be an explicit communication. In this case, the arc between M i and M i+1 is exposed. We assume that when an SPU is engaged in communication with another, it can not execute node programs. We label this communication cost by c i .
We formulate the mapping problem by looking at the signal rate constraint. Suppose M 1 is the initial node connected to the outside world. The number of data tokens consumed by M 1 is given by S 1 N c1 , where N c1 is number of data tokens consumed from its input. The arriving interval of data tokens at the input of M 1 is given by T in time units. The load constraint to meet the signal rate is given by w = S 1 N c1 T in If a sub-graph is assigned to processor P i , the load, L i is given by
where w j is the execution time for a node in the sub-graph, and c k is the communication cost of an arc on the two ends of the sub-graph.
De nition 6.1 The mapping problem under signal rate constraint for a stream ow graph is to nd a partition such that every processor's load is less than or equal to w.
Mapping Algorithm
The mapping algorithm consists of the following four steps: (1) expanded precedence graph EPG construction, (2) node coalescence, (3) graph conversion, and (4) graph partition.
Expanded Precedence Graph Construction
Since the number of rings for a node might be greater than one, the stream ow graph has to be expanded to re ect the multiple rings before the stream ow graph is mapped. Basic solution is the smallest solution to the state equations. It also de nes the iteration of a DSP application. When all nodes are red the number of times speci ed by the basic solution, we say that the DSP application has just nished an iteration. A directed graph representing the computations of an iteration is called expanded precedence graph (EPG).
A stream ow graph is expanded by the number of ring times of node programs. The procedure to construct an EPG from a stream ow graph is described as follows: where N p is number of tokens produced by M i and c 0 is the communication cost for passing one data token from a processor to a neighboring processor.
Node Coalescence
Node coalescence avoids assigning two heavily communicating nodes to di erent processors. The criteria of coalescing nodes is given by two theorems. An example is given to further explain the intuition of node coalescence. Theorem 6.1 Suppose a stream ow graph has a partitioning on n processors of weight w, then a similar partition or partition of less weight exists even if we merge node i and i+1 provided the communication cost between node i and node i+1 is equal to or greater than the weight of node i plus the sum of communication cost of all other arcs of node i.
A proof of this theorem can be found in Appendix A. The following theorem could be easily proved in the same way as the above theorem. Theorem 6.2 Suppose a stream ow graph has a partitioning on n processors of weight w, then a similar partition or partition of less weight exists even if we merge node i and i+1 provided the communication cost between node i and node i+1 is equal to or greater than the weight of node i+1 plus the sum of communication cost of all other arcs of node i + 1.
The above two theorems mean that the weight of a partition can be less even if two nodes are coalesced given the conditions are met. Node coalescence of a stream ow graph is accomplished by the following procedure: An example of node coalescence is shown in Figure 10 . If the left node and the right node are assigned to two di erent processors. The load for the left side processor with the left node is 19 + W l where W l stands for the rest of the load of the processor; while the load for the right side processor with the right node is 13 + W r where W r stands for the rest of the load of the processor. After coalescing the two nodes, and assigning both nodes to the left side processor, the load for the left side processor will be 14 + 3 + W l < 19 + W l and the load for the right side processor will be 3 + W r < 13 + W r 
Graph Conversion
Generally the stream ow graph is not a chain structured graph, here we study how to convert an arbitrary stream ow graph into a chain structured one. We rst studied 2 simple cases, and then a conversion rule is conceived based on the results of the two cases. For example, arc (A,B) is a transitive arc. Since the precedence relationship between A and B is automatically satis ed by (A,C) and (C,B), removing (A,B) will not change the precedence relationship of node A and B. By removing arc (A, B) , we add the communication cost of (A,B) to arc (A,C) and (C,B) . Figure 12a . This graph is converted to a chain structured graph by eliminating a parallel arc and moving the communication cost of that arc onto those arcs on a path from the original node to the destination node of the eliminated arc as shown in Figure 12a .
As is shown in Figure 12b , M x is chosen as the rst node to convert the graph into an linear array. Here M y could also be chosen as the rst node after M i . Which node comes rst depends on which of the values is smaller, w x + c 2 + c 3 or w y + c 1 + c 4 . An intuition for this arrangement is that when we partition a stream ow graph, we always take the node which will incur the smallest load to be included in a partition. This will guarantee that if a partition of w is able to take the extra load, the partition will always have a node of the smallest weight available to choose. Otherwise, a node with larger load may prevent other nodes with smaller load from being included in a partition without surpassing the partition load constraint of w. Figure 12c 
General Stream Flow Graph Conversion
Conversion rule for general stream ow graphs is based on the previous two cases. A stream ow graph can be partitioned into segments containing concurrent nodes. Given a subgraph, where there are g +1 nodes in between M i and M j , it is converted to a chain structured graph in the following step: 
Repeat
Step 2 and Step 3.
6. Repeat Steps 4 and 5 until the last node, M x+g in between M i and M j is reached.
As an example, we show a sub-graph with 5 nodes in between M i and M j . Figure 13b shows the graph when M x is chosen and the rest of the arcs connecting to M i are eliminated. Figure 13c shows the graph after removing transitive arcs. Figure 13d shows the result sub-graph one more conversion iteration, and Figure 13e shows the nal result after removing the transitive arc from M x+3 to M j .
The following procedure partitions the chain structured graph and gives the smallest number of processors sustaining the partitioning. 
Simulation study
This section presents simulation results on the e ectiveness of our parallel DSP architectures for stream ow programming. We are interested in obtaining the relationship among the grain-size, scheduling capability, and the utilization of processors.
A Petri net is constructed for the purpose of the simulation. As is shown in Figure 14 , the Petri net is composed of two transitions and two token queues. Transition t 0 , which represents SPU, consumes a re token from token queue q 0 . The time spent in t 0 is determined by the weight attribute of the token being processed. After a re token is consumed and processed, a done token is produced and put into done token queue: q 1 . Transition t 1 , which represents SU, consumes a done token from q 1 . The delay of t 1 is used to represent the scheduling processing capability of SU. When a done token is consumed by t 1 , a C program is invoked to emulate the mechanism of the scheduling algorithm.
In order to collect the simulation data about the utilization of the SPU, idle token is introduced, and the queue q 2 is created to accommodate the idle token. An idle token bears the attribute of total idle time of the SPU. A re token and an idle token are consumed together by t 0 . When t 0 is activated, the idle time between this activation and the completion of last re token is computed and added to the total idle time. Spectrum analysis, a typical DSP application, is used in our simulation study. The weight of the primitive DSP operations are shown in Table 3 . The utilization of SPU for the spectrum analysis stream ow graph with di erent grain-size are shown in Figure 15 . The horizontal axis is the schedule time for processing a done token. One observation is that the utilization of SPU is decreased when the schedule time becomes larger. This is because when SU needs longer time to process a done token, SPU is starved. Therefore,a better utilization of SPU is achieved by faster SU.
Another observation is that when the grain-size decreases, utilization decreases with the same scheduling processing capability. This is because when the grain-size becomes smaller, the scheduling unit becomes the bottleneck. In this case, done tokens accumulated in the done token queue, q 1 , while the re tokens were \eaten up" quickly. The schedule time is of no importance to utilization when the grain size is larger than schedule time. However, the utilization starts to decrease when the schedule time is comparable to the grain size. The utilization decreases sharply when the grain-size is smaller than schedule time. This suggests that SU become the performance bottleneck. spectrum data. The throughput is the maximum data samples that could be supported by the signal processing unit per second. An obvious observation is that the throughput is high (to the extent of SPU) when the schedule time is small. There is almost no di erence in throughput when the schedule time is not larger than 40 s. However, the utilization of the processor showed some di erences. The utilization di erence is caused by the initialization of the system when data samples start to arrive at the rst node in our simulation. The total idle time mainly comes from this initialization. The throughput, however, is obtained from the time interval between two processed data group when SPU is working in full scale.
The gure also shows that larger grain-size gives better throughput. That is true for two reasons: one is that the processing of batch data samples saves the setup time in SPU, the other is that the scheduling processing requirement is low. However, it is dangerous to conclude that we can use large grain-size stream ow graph without limitation, because increasing the grain size needs larger bu ers and results in long latency. Real-time signal processing systems do not tolerate storing data tokens and producing output data in a long interval, even though higher data sample rate is supported. We compare our work in two research topics. One is parallel architectures for general purpose DSP computation including multi-threaded architectures for data ow programming, the other is mapping data ow programs.
One related work is the parallel architectures for SDF | MOMA 12] . MOMA is a statically scheduled parallel DSP architecture, which employs a shared data memory and is used for implementing DSP applications captured by SDF. Our architecture will not only allow DSP applications with conditionals and data dependent computations to be implemented, but also degrade gracefully when a processor breaks down. Another di erence is that the memories are distributed in our architecture.
Another architecture we will compare is the AT&T Enhanced Modular Signal Processor (EMSP). Like our architecture, it is a large grain, dynamically scheduled architecture. However, EMSP does not have compile time mapping of DSP applications. EMSP assigns tasks to processors at run time by a separate functional unit, called scheduler, that monitors both the state of the program graph to determine when nodes are ready to execute, and the processors to determine when they are ready to accept new task assignment. One notable aspect of EMSP is that there is only one single scheduler, and there is no obvious way to add a second one. Since EMSP can be expanded by adding more memory and processors, there is some danger that the scheduler becomes the performance bottlenecks. Whereas, SU in our architecture can be scaled with distributed scheduling mechanism to avoid performance bottleneck.
Finally, a class of related works in parallel architecture is multi-threaded architectures for data ow programming 16] 6] 15] 7] 10]. The concept of multi-threaded architectures is to directly support multiple instruction threads at the processor level. A thread is a sequence of von Neumann instructions. A thread may be in any of the following states: waiting for operands, enabled, activated, and suspended. A thread can be enabled for execution when some of the required operands are available. It will be be suspended if an operand fetching is failed. Then another thread may take the processing unit without causing overhead of context switching. A suspended thread may be activated again once its operands are available again. Multi-threaded architecture has the potential to keep the processing unit usefully busy by rapid switching between threads on long latency operations, such as fetching operands from remote memory.
Our architecture is similar to multi-threaded architecture in that multiple node programs (threads) are enabled and ready for execution. In multi-threaded architectures, multiple register sets, program counters, and interleaved execution mechanism have to be provided. However, in our system, there is no such need for multiple active threads to support inside SPU, since when a node is enabled, operands and storage space are guaranteed to be available. Once a node is activated, it will not be suspended due to unavailable data. Making the DSP processor usefully busy is guaranteed as long as there are su cient available enabled nodes. Commercially available DSP processors with only one program counter and one register set can be used.
Earlier works on partitioning of chain structured program modules for linear processor array are found in 5] 11]. They were applicable to chain-structured programs. Unfortunately, stream ow graphs do not come nicely with nodes connected in a chain. In this paper, we proposed a systematic method to map a stream ow graph onto linear processor array through graph transformations, including construction of EPG, node coalescence and graph conversion. This mapping algorithm has substantially extended the earlier works to a wider signal ow graphs. It can be used for static assignment for fully static schedules as well.
Summary
In this paper, we presented a dynamically scheduled scalable parallel DSP architecture. A distributed scheduling mechanism is proposed to avoid potential performance bottlenecks when the con guration of the parallel architecture is large. We developed a systematic mapping algorithm to map a stream ow graph onto a parallel DSP architecture. The mapping algorithm substantially extended earlier works on partitioning. We proposed a dynamic scheduling algorithm that will only schedule a node for execution when both input data and output storage space are available. Such scheduling algorithm will allow bu er sizes be determined at compile time. Further, our simulation study revealed the relationship among the grain size, the processor utilization and the scheduling capability. The simulation study results will bene t parallel computer designers to determine the resource requirements in multiprocessor architecture design involving dynamic scheduling.
Future work involves the prototyping of the proposed parallel DSP architecture. As a rst step, a microprocessor will be used as the scheduling unit. Application-speci c programmable hardware implementation of the scheduling unit can also be considered. Another research direction is to implement stream ow programming environment for DSP and communication system design targeting the proposed architecture. We plan to implement such environment on top of an existing environment | Ptolemy 4] 3]. Our compilation techniques for dynamically scheduled architecture are planned to be introduced into Ptolemy within the stream ow programming environment.
Proof: When node i and node i + 1 reside in one processor, the proof is trivial.
We assume that node i and node i + 1 reside in two separate processor, P i and P i+1 .
Suppose node i has j+1 arcs. The communication cost of are denoted as (c i;1 ; c i;2 ; :::; c i;j ) and the arc connecting node i and node i + 1 is denoted as c i .
The load for P i is given by w P i = w i + c i + w r i ;
where w r i is the sum of all other node weights in the partition for P i plus the communication costs to the outside of the partition.
The load for P i+1 is given by w P i+1 = w i+1 + c i + w r i+1 ;
where w r i+1 is the sum of all other node weights in the partition for P i+1 plus the communication costs to the outside of the partition. Now we merge node i and node i + 1 and move node i to P i+1 . We calculate the loads for the processors after the merge assuming the worst case where all arcs of node i except c i reside in P i . c i + w i+1 + w r i+1 = w P i+1 :
