Abstract-In this paper, we consider multiprocessor implementation of real-time recursive digital signal processing algorithms. The objective is to devise a periodic schedule, and a fully static task assignment scheme to meet the desired throughput rate while minimizing the number of processors. Toward this goal, we propose a notion called cutofs time. We prove that the minimumprocessor schedule can be found within a finite time interval bounded by the cutoff time. As such the complexity of the scheduling algorithm and the allocation algorithm can be significantly reduced. Next, taking advantage of the cutoff time, we derive efficient heuristic algorithms which promise better performance and less computation complexity compared to other existing algorithms. Extensive benchmark examples are tested which yield most encouraging results.
I. INTRODUCTION N REAL-TIME digital signal processing applications, the I required throughput rate often exceeds that of a single processing unit, and hence requires the use of an application specific multiprocessor system. The purpose is to exploit the parallelism inherent in the DSP algorithm so as to expedite the computation. For example, in high definition television systems, 2 million 24 b pixels are to be processed in 1/30 s. Other applications which requires ultra high throughput rate include digital library, virtual reality, as well as video compression.
The problem of deriving a minimum-processor implementation for recursive DSP algorithms with real-time constraint is considered in this paper. We assume each processor is capable of sequentially executing the assigned tasks according to a specified schedule. We further assume that when a processor needs to access data computed in another processor, dedicated data bus connecting these two processors will be available. Hence during the subsequent discussion, we will assume that the delay due to interprocessor communication is negligible compared to the computing time of each task. The argument that dedicated data bus may be used is reasonable for situations such as high level synthesis of multimodule DSP processing elements where on-chip parallel data buses can be afforded. The assumption of negligible interprocessor communication delay will be inadequate when the DSP algorithm is to be realized on a general-purpose message passing based distributed memory parallel processors.
Real-time constraint is defined as the data sampling period, the interval between the arrivals of two consecutive data samples, can not be smaller than the initiation interval, the interval between two consecutive initiations of the DSP algorithm. One popular design style to implement real-time DSP algorithms called periodic schedule andfully static task assignment (PSFS) [ 11 is adopted. With a PSFS implementation, we need only to schedule and assign the operations in a single iteration of the DSP algorithm. Operations in other iterations will have the same relative schedule measured against the starting time of each iteration, and exactly the same processor assignment.
To fulfill the goal, two questions are asked. First, how do we know the given data sampling period is large enough to have a PSFS implementation for the DSP algorithm? In a recursive DSP algorithm, one or more output variables at the present iteration depend on their values computed in previous iterations. Due to the cyclic dependent relationship, there is a lower bound on the initiation interval for a given recursive DSP algorithm [2] . Look-ahead transfomuztion [3] can be used to reduce the low bound on initiation interval. However, lookahead transformation alone can not guarantee the existence of a PSFS implementation [4] .
Second, how to find a minimum-processor PSFS implementation when the existence of a PSFS for a given data sampling period is guaranteed. In order to find the minimum-processor PSFS implementation, all the possible PSFS implementations have to be searched. However, due to no constraint on how long one iteration of the DSP algorithm has to finish, there are possibly infinite number of possible PSFS implementations.
In this paper, we will present Architectural synthesis for Real-Time DSP (ART) . It consists of unfolding algorithm transformation and a novel scheduling and allocation method. Earlier, we have proposed a criterion called generalized pelfect rate graph (GPRG) [5] . We proved that a PSFS implementation which meets the real-time constraint exists if and only if the recursive DSP algorithm is a GPRG. Any recursive DSP algorithm can be transformed to a GPRG by unfolding at least minimum collapsing factor times [ 5 ] .
Below, we will concentrate on how ART derives a minimum-processor PSFS implementation for a given GPRG. To reduce the number of possible PSFS implementations searched, we will first propose a method to compute the cutoff time for the given GPRG and a prespecified data throughput rate. We will show that the optimal PSFS implementation can be found among all the possible PSFS implementations within the interval of cutoff time as determined. Having derived the cutoff time, we proceed to develop a novel two-phase algorithm in this paper: First we derive a minimum-overlapped periodic schedule using an artificial intelligence approach called planning [6] . Then, based on this schedule, we assign each task to a processor to derive a fully static assignment scheme while minimizing the total number of processing elements. This is achieved using a graph coloring algorithm [7] . Compared with previously reported algorithms, our method is faster, and produces better results when tested with an extensive set of benchmark examples.
We now briefly survey several relevant works: Nonoverlapped PSFS realization used by [8] -[ 113 takes advantage from the well investigated conventional multiprocessor scheduling problems [12] , where the tasks to be scheduled get executed only once. Many scheduling heuristics, such as critical path scheduling, are available. Algorithm transformation methods such as retiming [ll] , [13] , software pipelining [9] , loop winding [8] , and loop folding [lo] are used to minimize the initiation interval under the resource constraint, namely, a fixed number of processors. In general, retiming alone can not guarantee the existence of a PSFS implementation for a given initiation interval [4] . Incorporating node decomposition, in [ 131 a method to derive a nonoverlapped PSFS implementation for any given initiation interval is reported. However, for fine grain flow graphs where each node is not decomposable, this scheme is not applicable.
In [ 141-[ 161, a cyclo-static realization for multiprocessor implementation of DSP algorithms is adopted. A principal lattice vector is used to denote the displacement in the spacetime coordinate of successive iterations. However, an efficient way to determine the optimal principal lattice vector does not yet exist. Furthermore, the cyclo-static scheduling algorithm for a given principal lattice vector, as proposed in [ 161, has an exponentially growing worst case computational complexity.
Renfors and Neuvo [2] derived a PSFS implementation by exploring the maximal spanning tree from the signal flow graph (SFG). No attempt was made to minimize the number of processors, however. In [4] , Parhi reported that a PSFS implementation for any initiation interval can be guaranteed by unfolding the DSP algorithm to a so-called peifect-rate algorithm. They further developed an architectural synthesis tool, called MARS [ 171, which comprised a heuristic scheduling algorithm and unfolding, retiming algorithm transformation. As shown in [5] , the definition of perfect-rate algorithm may require excessive number of unfolding. Their scheduling algorithm has nonpolynomial-time worst case complexity. Recently, a range-chart-guided scheduling algorithm has been proposed [ 181, which deals with both performance constrained and resource constrained scheduling problems. The scheduling range for each node used in the range-chart algorithm is heuristically determined which may exclude the optimal schedule or induce extra scheduling overhead.
We note that there are other works [19] [20] [21] [22] concerning the implementation of linear recurrence systems using a mul- tiprocessor system have been reported. However, these work considered only finite length data stream which are readily available at the beginning of the execution. There are no real time throughput rate constraint to match. As such, none of these methods will be applicable to solve the problem discussed in this paper.
The remaining sections of this paper are organized as follows. In Section 11, the concept of GPRG is reviewed. In Section 111, an algorithm to determine a cutoff time will be presented. The scheduling and task assignment algorithm will be described in Section IV. In Section V, the results of some benchmark examples are reported.
GENERALIZED PERFECT RATE GRAPH
A recurrent DSP algorithm can be formulated as an infinite DO loop [l] . The loop body, which corresponds to the operations to process incoming data samples, can be represented graphically by an iterative computational dependence graph, (ZCDG), G = ( N , E ) , where N is a set of nodes corresponding to operations of the algorithm, and E is a set of arcs representing data dependencies. If an operation i of the xth iteration depends on the results from operation j of the yth iteration, the dependence arc is labeled with a dependence distance S;,j = x -y. In Fig. 1 [5] we proved the following theorem: Every ICDG can be transformed into a GPRG by unfolding. In [5] we showed that an ICDG can be converted into its corresponding GPRG by unfolding times, where agprg is called the minimum collapsing factor (for GPRG).
CUTOFF TIME FOR OPTIMAL PSFS IMPLEMENTATION
For the brevity of discussion, let us call the time duration in which operations in one iteration are scheduled an iteration span. In an overlapped periodic schedule, the iteration span of successive iterations may overlap. Potentially, the iteration span can be very large provided the precedence constraints are satisfied. This implies the scheduling algorithm has to search a very long time interval for an optimal solution.
In this section, we will derive a theoretical upper bound for the iteration span and prove that an optimal solution must lie within this interval. We call this upper bound the cutofftime (Tcut0~). The derivation of the cutoff time is based on the notion of equivalent periodic schedules. in Fig. 4 . In this chart, the entire circle is equally divided into d sections, each representing a time slot. An arc spanning across one ore more sections indicates that it is scheduled to the corresponding time slots. 
A. Equivalent Periodic Schedules
Refer to the ICDG in Fig. 2 , we may devise two different overlap schedules with the same period ( d = 6) as depicted in Fig. 3(a) , (b). Although these two schedules are different, they share the same processor request projile which is shown Lemma 1 follows directly from Definition 2. The question now is how to find Tcutoff or a tight upper bound of it. For this purpose, we will first discuss two special cases, namely present the results for the general case.
The primary schedule of Fu will be an equivalent schedule that has the smallest zL3 that satisfies the inequality above. 
Based on this observation, the computation of Tcutoff(d) can be formulated by the algorithm illustrated below.
1 )
For any edge ( n i , r~i +~) in the path,
From Acyclic(), we note that ~( n i )
node ni E N . Using 7 and 6, q i ( n t ) is computed by
+ Z(n2, nj); modified = 1; endfor until (modified == 0) Below, we show that Tcutoff determined by Acyclic() satisfies the condition given in Definition 2. To facilitate the proof, we first introduce two special types of nodes in an ICDG, source nodes and terminal nodes. Source node is defined as a node has no incoming edge with a dependence distance equal to zero. We denote N , as the set of source nodes in an ICDG.
In the opposite, terminal node is a node with no outgoing edge with dependence distance equal to zero. Nt is the set of terminal nodes in an ICDG. An isolated node of an ICDG can be either a source node or a terminal node. Now we present the Theorem: In a strongly connected ICDG, each node has at least one directed path to every other nodes. In this section, we will show that the iteration span of any valid periodic schedule of a strongly connected ICDG with a given initiation interval is bounded. A method will be given to derive an upper bound for the corresponding iteration span.
We first define a distance graph associated with a given ICDG. ( N , E ) is a directed graph. There is a distance, dist(ni,nj) , associated with each directed edge ( n ; , n j ) E E.
For example, the distance graph of the ICDG in Fig. 5(a) is shown in Fig. 5(b) .
To motivate our approach, let us examine the ICDG in 
Similarly, we have the following inequalities for the other two edges 
for all n, E N, and nt E Nt Proof: To prove this Theorem, we only need to show there is a shortest path between any pair of source node and terminal node in the distance graph. That is the shortest path algorithm will terminate.
In a strongly connected ICDG, there is at least one path from a terminal node ( n t ) to any source node (n,). Thus a path from n, to nt exists in the distance graph. Furthermore, since d 2 Imin, the sum of the distance of edges around any cycle in the distance graph must be nonnegative. Therefore, a finite-distance shortest path from any n, to nt in the distance graph exists. And K can be defined as the maximum among 0 all the shortest distance from any n, to nt + T n t .
From Theorem 3, we may consider a strongly connected ICDG as a super node with delay equal to K in order to estimate an upper bound of the iteration span of all the primary schedules. This observation is justified by the following Corollary.
Corollary I : Given a strongly connected ICDG G, = ( N , E ) and an initiation interval d 2 Imin, the cutoff time Thus, the Corollary is proved.
D. Cutoff Time for General ICDG's
Since the iteration span of a strongly connected ICDG with a given d is bounded by K , a strongly connected ICDG can be treated as a super node with delay equal to K when determining the cutoff time. Adopting this idea, we thus generalize our previous results to find Tcutoff for any ICDG's. The scenario to determine a cutoff time for any given ICDG, G = ( N : E ) is as follows. 1) Identify all the strongly connected subgraphs, G,, , from G and compute K ,~ for each strongly connected subgraph.
2 ) Reduce G to a reduced graph G, by replacing each Gst by a super node, si, with delay set to K,,.
3) Since G, is acyclic, Theorem 2 can be applied to compute a TcutoE for G,.
To summarize, we have the following theorem. 
0

Iv. SCHEDULING AND ALLOCATION ALGORITHM
Once the cutoff time Tcutoff is computed, we are ready to determine a schedule and processor assignment for each node in the given ICDG. In this paper, we propose a two-phase heuristic algorithm to solve this scheduling and assignment problem in two consecutive phases: First, each task is scheduled to start at a time slot and extended through a duration equal to the node computing time. The objective is to minimize the maximum number of tasks scheduled on the same time slot subject to the constraint that the desired initiation interval is maintained. Second, we assigns each node to processors such that the number of processors used is minimized under the constraint that only one task can be executed on a processor at a time.
A. Periodic Schedule
A common feature of combinatorial optimization problems is that the solution to the overall problem can be decomposed into the solution of a set of subproblems. There are often several candidate solutions to each subproblem. The order of which these subproblems are being solved, and the candidate solution chosen for solving each subproblem will dictate the quality of the solution. In the periodic scheduling problem, the schedule of each node in the ICDG is a subproblem. The time interval [O, Tcutoff] contains the set of potential solutions to each subproblem.
The most critical principle applied here is to determine the order of which these subproblems are being solved. In particular, this principle calls for the prioritization of the subproblems according to a problem-specific heuristic cost function called criticalness. In this research, we use a criterion called feasible scheduling interval as a measure to determine the criticalness of each node in the ICDG. A node is more critical if it has fewer available feasible schedules. The criticalness of each node is dynamically updated for the set of un-scheduled nodes as more critical nodes are being scheduled.
The least impact principle is designed to help choosing an appropriate solution from a set of feasible solutions to solve a particular subproblem. By committing the current subproblem to any of its feasible solution will unavoidably consume precious resources, and thereby reducing the feasible solution space of remaining unsolved subproblems. The least impact principle favors the choice of a solution which will impose the least amount of impacts to the feasible solution space of those unsolved subproblems. To apply this principle, in this paper, we measure the impact of scheduling a particular node at a time slot using a criterion called scheduling cost. A low cost schedule is preferred to a high cost schedule as the former leaves more flexibility to schedule remaining un-scheduled nodes. can not be started until all its predecessors complete. That is
I ) Feasible Scheduling Interval and the Most Critical Node:
(9)
While the ALAP schedule of node i denoted as Li is defined as the schedule of i closest to Tcutoff. Since In our algorithm, we schedule each node sequentially. According to the most critical principle, a node should be scheduled first if it is the current most critical node. The criticalness for a node i, denoted by Ccri(i), is measured by the ratio of the length of the feasible scheduling interval and delay of that node. That is A node with a larger value of Cc,,(i) implies that it has fewer alternative feasible schedules. Hence it will be given a higher priority to be scheduled early. If there are more than one nodes with the same criticalness, we arbitrarily pick one to schedule first. Example 2: Consider the ICDG in Fig. 1 again. From Example 1, we have computed the cutoff time Tcuto~ = 30.
The ASAP schedule and ALAP schedule are shown in Table   I . From 11, the criticalness of each node is computed and tabulated in Table I . A and C are the most critical nodes.
2) Least Impact Schedule: Once the most critical node is selected, it will be scheduled using the least impact principle. First, note that the time span of the schedule of node n, is
where F(n,) is the scheduled starting time of node n,. A time slot t is occupied by node n,, if [t, t + 1) c sp(n,). Since Snt 5 F(n,) 5 Lnt, we have we define pn, ( m ) , the demand probability of node n, to time slot m, as follows: is the set of all the nodes in the ICDG. A time slot with high demand, thus becomes a scarce resource and, according to the least impact principle, should be avoided when scheduling a given node. To measure the impact of scheduling a node in a particular time duration, we define a scheduling cost of node n;, denoted by Csch (ni , t), as the sum of the total demands by Node delay ASAP ALAP Ccri 0 5 n l 2
other nodes to each of the time slots within [t, t +~,~) .
That is,
= c c
Pwn, (k).
n,EN and n J # n , k E [ t modulo d , ( t + r , , ) modulo d )
The least impact schedule is chosen to be the one which minimizes Csch(ni, t ) over the range [S,%, Lnz].
Once a node ni is scheduled at t, we will update the demand probability by setting pwn,(m) = 1 for each
to reflect the fact that these time slots have been formally occupied by node ni. Example3: In Table 11 , the ASAP and ALAP schedules for a three-node ICDG with d = 8 are shown. According to most critical principle, n2 will be scheduled first. In order to determine the least impact schedule for n2, we need to compute Csch(n2, t ) for all t E [l, 61. The operations in 12 and 13 to compute pwnl(m) and pwn3(m) for all m E [0, d ) are summarized in Fig. 7 , where each arc represents a possible schedule. By counting the number of overlapped arcs in Fig. 7 for each time interval, pw can be determined. Since the ASAP and ALAP for n2 are 1 and 6, respectively, six possible schedules and the corresponding CschS' are tabulated below. The time span (6, 9) has the smallest Csch. Therefore, the least impact schedule is to start n2 at t = 6.
B. Processor Allocation
Once a periodic schedule is obtained, we need to assign tasks to processors such that the number of processors used in the implementation is minimized. The problem of finding a minimum-processor allocation for periodic schedule is equivalent to the circular-arc coloring problem [26] .
Two tasks are incompatible if and only if their corresponding circular arcs of processor-request profile overlap. Two incompatible tasks cannot be assigned to the same processor. This relationship can be represented by a undirected graph, called incompatibility graph. The nodes represent nodes in ICDG and two nodes are connected if and only if their corresponding tasks are incompatible. In this paper, an efficient heuristic algorithm which was designed for the digital system test scheduling problem [7] is employed. We will only illustrate the algorithm by an example. Readers please refer to [7] for more details of the algorithm.
Let us consider the periodic schedule in Fig. 3(a) as an example. Its incompatibility graph is shown in Fig. 8 . --I Fig. 19 ).
that is not used by any of its neighbor nodes. The results are listed below:
In this example, our coloring algorithm (processor assignment) uses only 3 colors (processors), which is optimal.
C. Worst Case Complexity
The periodic scheduling algorithm described in Section IVAl) and Section IV-A2) will schedule one or more nodes we determine a coloring sequence according to the degree (number of connected nodes) of each node in ICG. Each time a node with largest degree is chosen. Then delete the edges connected to that node and choose the another node until all the nodes are selected. If there are more than one node having the same degree, we arbitrarily pick one. The results of the example are tabulated below:
The coloring order of these nodes, in descending order, is E , F, A, B , D, and C. Next, we start to color the nodes according to the order. The color of node is the lowest number
v. EXPERIMENTAL RESULTS
The periodic scheduling and allocation algorithms have been implemented and tested on several benchmark examples reported in the literature [16] , [18] , [27] . The CPU time (Tcpu) reported in our results is measured on a SUN SPARC Station 10/30 with 128 Mbyte main memory.
Example 4: In this example, we compare our method with MARS, an architectural synthesis tool [27] using an ICDG depicted in Fig. 9 . By applying our scheduling and allocation algorithm to the twice-unfolded ICDG, we can achieve the minimum-processor PSFS implementation. In Table 111 , we compare our result with that derived from MARS. We note that not only ours needs fewer unfolding, but also use fewer processors.
Example 5:
In the example, we compare our results with those using the range-chart guided scheduling algorithm as reported in [ 181. All these examples satisfy the condition of a GPRG, hence no unfolding is needed. The results are tabulated in Table IV . The number of processors needed for each of the example is the same for both methods.
Example 6: The results reported in [16] , which are obtained using cyclo-static scheduling method, are compared. According to [ 161, the delay of addition operation and multiplication operation are set to 1 and 2 clock ticks, respectively. No unfolding is needed for all benchmarks, since they all satisfy the condition of GPRG's. The results are summarized in Table  V Note that in the first two benchmarks, our implementations need fewer processors than those reported in [16] .
Example 7: In many DSP applications such as linear equalization and echo cancellation, higher order filters are desired. In this example, we test our algorithm on higher order digital filters. We assume the addition and multiplication take 1 and 2 clock ticks. The results are summarized in Table VI. With reasonable computing time, our algorithm produces very satisfactory results.
Example 8: In this example, the design of an encoder for adaptive differential pulse code modulation (ADPCM) [28] according to CCI'IT Rec. 32 kbps G.721 is shown. Referring to Fig. 10 procedure given by CCITT Rec. G711. An estimated value of this signal, s e ( k ) , obtained from the adaptive predictor, is subtracted from sl ( k ) to produce the difference signal d( k ) . The difference signal is encoded with a 16-level quantizer, where parallel search is used, to yield the 4-bit ADPCM value I ( k ) that is transmitted to a distant decoder. The inverse quantizer Q-l generates the quantized difference signal d,(k) from I ( k ) using the scale factor y( k ) . An adaptive predictor consisting of 2-pole and 6-zero produces the signal estimate thus closing the feedback loop. The adaptation algorithm for updating the coefficients of the linear predictor is shown in Table VI1 summarizes the numbers of different types of operations for the encoder.
In Fig. 11 , we demonstrate a PSFS implementation derived from our proposed algorithm with data sampling period equal 
VI. CONCLUSION
In this paper, we proposed a methodology to synthesize synchronous multiprocessor systems for real-time DSP algorithms. The style of periodic schedule and fully-static processor assignment is adopted. After transforming the ICDG to a previously proposed GPRG, our scheme consists of two steps. We first compute a cutoff time for periodic schedule, by which all the operations of an iteration have to be finished. A systematic way of determining a cutoff time is proposed. We showed that cutoff time derived from our scheme guarantees the existence of the minimum-processor implementation within the interval of cutoff time. Then, a novel heuristic method to schedule and map a GPRG onto a multiprocessor system is then proposed. By experimenting an extensive set of benchmarks, our algorithm can achieve the most processor efficient design in very low complexity. At the same time we can still preserve the real-time constraint.
Although our assumption for interprocessor communication delay is valid if all the processors can be put on a chip, this might not be true for more complicated DSP algorithms. In [29] , the authors have studied the impact of the communication delay to the method proposed in this paper. A modified algorithm which takes interprocessor communication delay into account has been proposed in [30] .
