This paper describes a technique for obtaining maximumthroughput schedules for iterative data-flow graphs, given a limited set of resources. The method is based on the use of the scheduling-range chart, where the scheduling window of each operation is displayed. This scheduling window, or scheduling range, is relative to a reference operation, and can be finite or infinite. The information in the schedulingrange chart is used during scheduling in order to optimize the sampling period. An algorithm based on this approach, for scheduling of acyclic as well as cyclic data-flow graphs, is presented.
Introduction
The practical value of a digital signal processing (DSP) algorithm is often dramatically increased when the algorithm is implemented in real time (think e.g. of audio applications, radar, sonar, etc.). Parallel processing is unavoidable for real-time implementations of high-speed and/or complex DSP algorithms. As a consequence, a number of multiprocessor architectures for real-time DSP have been proposed (e.g. [2,10,13]). The speed of the implementation will not only depend on technology aspects but also on the efficiency of the design tools. Among all of them [l] , the tools that perform the scheduling and assignment to the hardware resources of the tasks that compose the DSP algorithm, have a crucial role.
For a big number of DSP algorithms, with no data-dependent conditionals, the scheduling process can be performed statically, at compile time [ll]. Besides, DSP algorithms are characterized by their infinite repetition, which allows the exploitation of not only the parallelism between operations of the same iteration (inter-iteration parallelism), but also the parallelism between operations of different iterations (intraiteration parallelism).
In this paper, the scheduling problem is modeled in a similar way as proposed by Schwartz in (141. The algorithm specifications are mapped on a data-flow graph (DFG), where the nodes represent tasks or delay elements and the directed arcs represent the precedence relations. Associated to each task T, there is a processing time, tcT,. In this way, it is possible to model acyclic as well as cyclic graphs. Figure 1 shows t,he cyclic DFG of a second-order digital filter section. The DFG together with the set of hardware resources, that for reasons of simplicity will be considered as a set of identical processors, provide the model for the scheduling problem. Interprocessor communication delays are assumed to be neglectable with respect to the computation time. The (prcr cessor) scheduling of iterative DFG's consists in finding a suitable periodic ordering of the tasks and their assignment to the set of processors, such that the precedence relationships are not violated, and a certain objective function is optimized. When the main interest is speed, i.e. high sampling rates, the scheduling .goal is to minimize the sampling period. Besides, there can be secondary goals, as the minimization of the number of processors or the register life time.
Hierarchy is provided by the DFG model since the nodes can represent operations of any type of complexity, from indivisible (atomic) operations (fine grain) up to complex DFG's (coarse grain). The parallelism present in the DFG increases as the grain size decreases; therefore, the atomic DFG shows the maximum parallelism present in the algorithm. Without loss of generality, the computational nodes in this paper will be assumed atomic and non-preemptive.
In [6,9] a scheduling algorithm is discussed that minimizes the resources for optimal sampling rate. Here an algorithm will be presented that optimizes the sampling rate, for a fixed number of resources. The algorithm decisions are guided by the information displayed in the scheduling-range chart.
Maximum-Throughput Schedules of Iterative DFG's
Those schedules that achieve the maximum speed for a given number of resources will be called maximum-throughput schedules. For repetitive DSP algorithms, maximizing speed or sampling rate, consists in minimizing the sampling period TO.
In acyclic DFG's it is possible to use pipelining in order to increase parallelism, and reduce the sampling period To, until the bound imposed by the hardware constraints of the limited resources. For unlimited resources, the speed bound, theoretically' becomes infinite'.
The speed bound for cyclic DFG's depends not only on the number and computational delay of the hardware resources, but also on the topology of the graph. The minimum sampling period To,,, of a cyclic DFG is determined by its critical loop [12] . A critical loop is that directed loop for which the summation of the processing delays of each operation in the loop divided by the number of delay elements in the loop, nl, is maximal. This quotient is To,,,.
Expressed in another way, Schedules that can be executed at this lower bound are called rate optimal. In the DFG of a second order digital filter section given in Figure 1 , the critical loop is the one formed by nodes 4-2-6, and To,,, = 3 TU'S.
The Scheduling-Range Chart
Given a reference operation whose position in the time schedule (assignment of the operation to specific, repetitive timeslot(s)) is fixed, it is possible to determine the relative allowed positions of the rest of the unscheduled operations. The set of positions where an operation can be scheduled is called the scheduling range. The scheduling-range chart [6] displays this information for every operation in the DFG.
The difference between the length of the scheduling range of an operation and its duration is called its mobility. The scheduling range of an operation of an iterative DFG can be finite or infinite. When a limit of the scheduling range 'Communication delays are not being considered. 'The sampling period can be made shorter than the duration of the longest atomic operation by using direct blocking [8, 15] has a well-defined position, i.e. when all the predecessors or successors of the operation have been assigned to a specific repetitive time slot (time schedule), it is called a fized limit. The notation used in this paper is shown in Figure 2 .
To build the scheduling-range chart, an operation is taken as a reference. This operation has no mobility. The scheduling range of the rest of the operations in the graph is determined by (the joined effects of) the following constraints:
The forward precedence relation between operations. An operation cannot start its execution before all its predecessors have finished their execution. See Backward precedence relation between operations related by registers. Operations which are in a common path, and separated by a node representing one or more delay elements should be scheduled such that the value stored in a register is not rewritten before the successor operation(s) has read it. Expressed in another way: let Ti be an operation whose result is written to node D,, a node representing a number m of delay elements. Let T, be an operation that reads a value from D,. Then, T, and Tj should be scheduled such that the following relation is respected: where tTk is the time when operation Tk starts its execution, tCTk is the execution time of task Tk, and To is the iteration period. This is shown in Figure 3 (b).
Slack time.
This is a consequence of the backward precedence relation among operations that belong to the same loop. An operation in (a) loop(s) has a limited mobility (with respect to a reference operation in the loop) equal to the shortest slack time of the loops of which it is a member. The slack time of a loop L is defined by:
T. time scheduled, the scheduling range has a fixed limit for operations 3, 9, and 10, (whose only predecessor is 2), and 1 (whose only successor is 2).
The properties of the scheduling-range chart are discussed in more detail in [6] . There it is shown that any schedule that satisfies the backward and forward precedence relations, will automatically satisfy the slack time in loops.
Maximum-T hroughput Scheduling with Limited Resources
When both the sampling period and the number of resources are fixed, as is done in [14], a solution for the scheduling problem may not exist [6] . In order to guarantee a solution either of the two parameten should be left free to be optimized by the algorithm. In [6, 9] a scheduling algorithm is discussed that minimizes the resources for optimal sampling rate. Here an algorithm will be presented that optimizes the sampling rate, for a fixed number of resources. The algorithm decisions are guided by the information displayed in the scheduling-range chart.
Since the schedule needs to be periodic in To, we need to take into account the effect of the overlapping in time of the execution of operations of different iterations. This is performed by folding the execution of one iteration by modulo is NP-complete, can be reduced to an instance of the iterative scheduling problem.
Computational Iteration Period Bound
We define the the computational itemtion period bound, To,, a lower bound imposed by the computational delays of the nodes in the DFG and the number of processors, P, (the topology of the graph is not considered) as:
where, 0 is the DFG, and t q is the execution time of task T,. Consider the small acyclic DFG of Figure 5 . The three operations have a duration of 1 TU. The respective computational iteration period bound, for P = 1, P = 2 , and P = 3, is given by:
The Scheduling Algorithm
The scheduling algorithm introduced in this section is valid for cyclic as well as for acyclic DFG's. Acyclic DFG's are a special case of the first.
Cyclic DFG's have a minimum sampling rate with unlimited resources, Tom,,,, imposed by the topology of the graph and the hardware constraints [12] . For acyclic DFG's, the minimum sampling rate with unlimited resources, T,,,,, is 0. With fixed resources, the sampling period To has a lower bound given by Tob.
So, with the number of processors fixed to be P, the lower bound for To is given by:
The description of the algorithm is given below:
1. Determine the computational iteration period bound,
2.
If the DFG is cyclic, determine the critical loop, and To,,,(by i.e. the method proposed in [ 5 ] ) , otherwise, make To,,, = 0.
3.
Make To = mar(To,,Tom,,,).
4.
Select a reference operation, and determine the scheduling-range chart for To.
.
Partition the scheduling-range chart in sectors that correspond to each equivalence class, consecutively labeled from 0 to To -1.
6. Maintain for each equivalence class a pointer to the first available level; initially, this pointer has a value 1 for all equivalence classes.
7.
Select the unscheduled operation with the shortest scheduling range. In case of equal length, give priority to an operation with a fixed-limit range. Select by lowest label, if multiple operations have a range with fixed limit.
8. Assign the selected operation to the equivalence class(es) that has (have) the lowest first available level(s). Of course only those equivalence classes are considered, which are covered by the scheduling range of the operation. Since the level of a class is not associated to any processor, operations that cover more than one class can be assigned to different levels.
When the operation cannot be assigned to the available empty classes, repeat the procedure for iteration period adjustment described in Section 4 until the operation has enough contiguous available classes to be assigned.
9. Update the scheduling-range chart to incorporate the fixed position of the operation just scheduled. Update the first available level of the affected equivalence classes.
10. Repeat steps 7, 8 and 9 until all operations have been scheduled.
11. Assign operations to processors.
(a) Sort operations according to their computational delay; the longest first; in case of equal duration sort by label.
(b) Maintain for each equivalence class a pointer to its first available level which now is associated to a processor. This pointer should be initialized to 1 for all equivalence classes.
(c) Remove the first operation from the list and assign it to the first processor level that has empty all the equivalence classes to which the operation has been assigned to.
When there is no processor that fulfills this condition, repeat the procedure for iteration period adjustment described in Section 4 until the assignment of the operation is possible.
(d) Update the first-availablelevel pointers. Postponing the processor-assignment phase, as described by the above algorithm, is not necessary when all the operations have unit length. Each level in the equivalence classes can be directly associated to a processor. Figure 6 sequentially shows the scheduling of the three unit-length operations of the small acyclic example of Figure 5 , for a number of processors P = 2 and P = 3. 
Iteration Period Adjustment
Fixing the iteration period TO and the number of processors P, as proposed by Schwartz in [14] , does not always lead to a solution [9] . The rate-optimal scheduling algorithms proposed in [SI guarantee a solution by fixing TO and optimizing the number of resources. If we define the computational processor bound, Po, as:
then, those algorithms will provide a solution in a number of processors P such that:
In this paper, the number of processors is fixed and the algorithm will optimize the iteration period. A solution is guaranteed with a iteration period given by:
The reasons why fixing both To and P may not lead to a solution are the following:
0 The precedence constraints in the DFG can be such that the equality in Equation 8 cannot be achieved by any schedule in P processors.
An example of such a DFG is given in Figure 9 (a); Tomin= 4 TU'S and Pob = 2. The initial state of the scheduling-range chart is shown in Figure 9 (b). The precedence constraints are such that the only operation with mobility is 7. The rest has a fixed position. There are only two possible class-assignments for operation 7 (its duration is 2 TU'S and due to the non-preemption condition it should be assigned to consecutive classes): (0-1) and (1-2). However only classes (2-3) are available. A solution with Po, in T h i n does not exist, either the number of processors or the iteration period has to be increased. e The use of a heuristic cannot guarantee an optimal solution.
The iterative scheduling problem is NP-complete. The use of heuristics, as those proposed in this paper, to find a schedule in polynomial time cannot guarantee a solution in the minimum sampling period, as given by the equality in Equation 8.
Therefore when fixing the number of processors to P, the scheduling algorithm should include some mechanism to increase To, when necessary, in order to guarantee a solution. The adjustment of To can be necessary during the classassignment phase and/or during the processor-assignment phase.
The procedure for adjustment of the iteration period consists of the following steps:
1. Introduce a new equivalence class within the scheduling range of the conflicting operation (the operation that cannot be scheduled or assigned to any processor) as shown in Figure 11 . 2. Make TO = TO + 1.
3. Update the scheduling-range chart for the inclusion of the new equivalence-class, and increase in the sampling period. Figure 12 shows the iteration-period adjustment during the equivalence-class assignment phase. The rectangles at the bottom show the equivalence-class assignment of operations. 
Time-Complexity Issues
In order to determine the overall time complexity of the algorithm, the following main steps have to be considered:
1 . The computation of Tomi,. This value is necessary for any scheduling algorithm for iterative cyclic DFG's, not only the one presented in this paper. In of edges in the DFG.
The generation of the inequalities that express the for-
ward and backward precedence relations. There is at most one forward and one backward precedence relation for each pair of computational nodes. So, the number of relations is O ( c 2 ) , where c is the number of computational nodes in the DFG.
The combination of all inequalities to find the scheduling range for each operation with respect to a reference operation.
In [5] it is explained how the FourierMotzkin elimination method [3] can be adapted to operate in polynomial-time for the special type of inequalities occurring here. The result is an 0 ( c 3 ) algorithm.
The updating of the scheduling ranges. Each time an
operation is given a fixed position within its range (there are at most To choices), the ranges of the other operations can be updated in U ( c z ) time. So, as c operations have to be scheduled, this step needs U(cTo+c3) time.
The adjustment of the iteration period.
In the equivalence-class assignment phase this requires U( cT0) time and in the processor-assignment phase U(cPT0) time. It is assumed that the number of adjustments is a small number that does not grow with the problem size.
So, the total time complexity of the algorithm is: U ( d e + d4 + c3 + cPT0).
Conclusions
An algorithm based on an alternative scheduling approach for iterative acyclic and cyclic DFG's with limited resources that exploits inter-and intra-iteration parallelism has been presented. The proposed method is based on guiding the scheduling algorithm with the information supplied by a scheduling-range chart.
The priority for the selection of operations based on mobility, and ranges with fixed-limits, intends to maximize the scheduling range of the yet unscheduled operations.
For cases where the precedence constraints do not allow a schedule in the originally selected optimal sampling period, To, the algorithm provides an adjustment procedure, so a solution is always guaranteed.
The delay of the processor-assignment phase increases the efficiency of the algorithm, when the operations have processing times different from the unity, and are non-preemptive. The algorithm provides a procedure for the adjustment of To for those cases where a solution requires more processors than the number of levels of the equivalence classes (71.
