This paper presents a new method for rate-optimal scheduling of recursive DSP algorithms. The approach is based on the determination of the scheduling window of each operation and the construction of a "scheduling-range chart". The information in this chart is used during scheduling in order to optimize some quality critcria (number of hardware resources, latency, register I i k time) at the same time that a rate-optimal solution is guaran-(ccd. An algorithm based on this approach is introduced. It can schedule cyclic as well as acyclic data flow graphs. The algorithm is powerful enough to solve optimally some problems for which otlicr proposed methods fail.
Introduction
As a consequence of the complexity and speed of DSP algorithms, irnplementations that exploit parallelism are becoming more and more popular. This requires the support of suitable design tools. Oiic of these should perform the scheduling and assignment of the opcrations to the hardware resources, which, for a big number of DSP algorithms, can be performed statically [5].
l'lic algorithm description given as a signal flow graph can be mappcd onto a data flow graph (DFG) , where the nodes represcrit tasks or delay elements and the directed arcs represent the prcccdcnce relations. Associated to each task Ti there is a proccssiiig time, tcT,. Figure 1 shows an example of a cyclic DFG. The DFG together with the set of hardware resources, that for reasons of simplicity will be considered as a set of identical proccssors, provide the model for the scheduling problem. In order to bc ;hie to exploit all the existing parallelism, the tasks will be assuiiied to be atomic. Communication delays are not considered.
W h m the main interest is speed, the scheduling goal is to minimize the sampling period. Without using pipelining, the miniiiium sampling period for execution of an iterative acyclic DFG is liniited by the critical path, whenever enough computing power is available.
Similarly, for iterative cyclic DFG's there is a lower bound for the sampling period, imposed not by the critical path, but by the critical loop [lo] . This bound is called iteration period bound and is tlcnotcd by To,,, . Schedules that can be executed at this sampling period arc called rate-optimal schedules. Basically, t,hree schedulirig techniques that provide such schedules have been reported: maximum spanning tree [10, 11] , search of cyclo-static schedules [I 1,121 , and optimum unfolding [7] . ' This research has been partially financed by the Foundation FOM ("Innovative CAD Tools for IC Design", TEL 330408). The first has the disadvantage that it does not try to optimize the number of processors, leading in general to solutions with suboptimal processor utilization [3]. The second consists of a depth first search of cyclo-static solutions, where the schedule is not only periodic in time but also in the processor space. In order to limit the search space, only solutions that are rate-optimal and requiring a number of processors equal to the processor bound, i.e. the minimum computing power necessary to execute the graph at optimal rate, are allowed. So, the solutions that this method provides are always processor optimal, when there exists a solution that can achieve such a bound. Otherwise, after exponential time with respect to the number of nodes, the algorithm finishes the search unsuccessfully. The third method consists of reducing the graph to an equivalent perfect-rate data-flow program that can always be scheduled rate-optimally. This is performed by unfolding with an optimal unfolding factor. The main disadvantage of unfolding is that the memory space necessary to store the program increases proportionally with the unfolding factor. This paper presents an alternative approach for the rate-optimal scheduling of cyclic DFG's based on the determination of a scheduling range chart. The information in this chart is used during scheduling in order to optimize some quality criteria (number of hardware resources, latency, register life time) at the same time that a rate optimal solution is guaranteed.
2
The Scheduling Range 2.1 Non-Iterative Acyclic DFG's For a non-iterative acyclic DFG (or iterative acyclic DFG without use of pipelining), an optimal schedule has the length of the critical path. Operations in the critical path cannot be deferred without affecting the length of the schedule. They don't have any mobility, and have to be scheduled in fixed time slots. Opera-CH2868-8/90/oooO-1805$1.00 0 1990 IEEE tions that don't belong to the critical path can be scheduled in a range of time slots, or scheduling range. In an acyclic DFG, the scheduling range of an operation is finite and determined by its positions in the early and late schedule [9, 8] . For an operation in the critical path, its early and late-schedule position are the same.
Based on information provided by the scheduling range, various scheduling algorithms for non-repetitive acyclic DFG's have been proposed. Examples are PERT [6] , the scheduling strategies proposed by Ramamoorthy et. al. in 191 , and the critical path algorithm for microcode compaction' proposed bv Ramamoorthv and Tsushita 141. Another example is the recently proposed forcedirected scheduling [SI for behavioral synthesis of ASIC's that has been reported as very successful.
Iterative DFG's
In iterative DF'G's it is possible to exploit not only the parallelism between operations of the same iteration, but also of different iterations. When the DFG is acyclic, pipelining can be used to reduce the sampling period To, which, for unlimited resources and neglecting the communication delays, becomes infinite'.
The speed bound for cyclic DFG's depends not only on the number and computational delay of the hardware resources, but also on the topology of the graph. The minimum sampling period TO,,, is determined by the critical loop [lo] of the DFG.
The Scheduling-Range Chart
Given a reference operation whose position in the time schedule (assignment of the operation to specific, repetitive time-slot(s)) is fixed, it is possible to determine the relative allowed positions of the rest of the unscheduled operations. The set of positions where an operation can be scheduled is called the scheduling range. The scheduling-range chart displays this information for every operation in the DFG. The difference between the length of the scheduling range of an operation and its duration is called its mobility. The scheduling range of an operation of an iterative DFG can be finite or infinite. When a limit of the scheduling range has a welldetermined position, i.e. when all the predecessors or successors of the operation have been scheduled, it is called a fixed limit. The notation used in this paper is shown in Figure 2 .
To build the scheduling-range chart, an operation is taken as a reference. This operation has no mobility. The scheduling range of the rest of the operations in the graph is determined by (the joint effects of) the following constraints:
The precedence relation between operations. An operation cannot start its execution before all its predecessors have finished their execution. See Figure 3 
(a).
Backward precedence relation between operations related by registers. Operations which are in a common path, and separated by a node representing one or more delay elements should be scheduled such that the value stored in a register is not 'Do not confuse with the "critical path list scheduling" algorithm. 'The sampling period can be made shorter than the duration of the longest atomic operation by using direct blocking [3, 11] , or cyclo-static tech niques [11, 12] . rewritten before the successor operation(s) has read it. Expressed in another way: let Tj be an operation whose result is written to node Dl? a node representing a number n of delay elements. Let Tj be an operation that reads a value from Dl. Then, T, and T, should be scheduled such that the following relation is respected:
tT, + tCT, 5 tT3 + nT0
( 1 ) where tT, is the time when operation Tk starts its execution. tcT, is the execution time of task Tk, and
To is the iteration period. This is shown in Figure 3(b) .
Slack time in loops.
This is a consequence of the backward precedence relation among operations that belong to the same loop [2] . An operation in (a) loop(s) has a limited range mobility equal to t,he shortest slack time of the loops of which it is a member. The slack time of a loop C is defined by: is discussed that optimizes the sampling rate, for a fixed number of r(wurces. Here an algorithm will be presented that minimizes t , I i c number of hardware resources, for a fixed sampling rate. The iilgorithin can be applyed to cyclic as well as to acyclic DFG's.
T l i c sclirduling algorithm is guided by the information displayed hy t.lic scheduling-range chart. In order to take into account the olfcct of the overlapping in time of the execution of operations of differcut, iterations during the scheduling, the operations are groupcd in ' 7' 0 that has (have) the lowest first available level(s). Of course only those equivalence classes are considered, which are covered by the scheduling range of the operation. Since the level of a class is not associated to any processor, operations that cover more than one class can be assigned to different levels. Update the first available level of the affected equivalence classes. 7 . Update the scheduling-range chart to incorporate the fixed position of the operation just scheduled. 8. Repeat steps 5,6, and 7 until all operations have been scheduled. 9. Assign operations to processors.
(a) Sort operations according to their computational delay; the longest first; in case of equal duration sort by label. (b) Maintain for each equivalence class a pointer to its first available level which now is associated to a processor. This pointer should be initialized to 1 for all equivalence classes. (c) Remove the first operation from the list and assign it to the first processor level that has empty all the equivalence classes to which the operation has been assigned to. Update the first-available-level pointers. (d) Repeat the last two steps until all the operations have been assigned to processors.
Figure 5 sequentially shows the previous steps for the rate-optimal scheduling of the DFG of Figure 1 . Operation 2, that belongs to the critical loop, is used as reference. To = To,,, = 3 TU'S, so there are three equivalence classes: 0, 1, and 2, which are assigned sequentially to sectors in the chart. After this step, operations 2 and 4, that don't have mobility have been automatically assigned to classes 0, and 1 and 2 respectively, in level 1 (Figure 5(a) ). The next selected operation is 1. There are no more empty levels, so a new level is created. The operation is scheduled giving preference to the position where the fixed limit is, filling level 2 of class 2; the chart is updated ( Figure 5(b) ). The algorithm continues with 3. Of the three possible positions, (1-2), (2-0), and (0-l), the algorithm chooses the first in order not to create a new level ( Figure 5(c) ). The algorithm continues in this way until all the operations have been assigned to relative time slots (time schedule), and to equivalence classes ( Figure 5(g) ). The result of the processor assignment phase is shown in Figure 5 (h); the processor utilization is 100%. Figure 5 (i) visualizes the overlapping in the execution of three consecutive iterations. Figure 6 shows the rate-optimal schedule for the same DFG example, obtained by using the maximum spanning tree method [lo, 111, which requires 5 processors. Figure 7(a) shows an example of a cyclic DFG which does not have a rate-optimal solution for a number of processors equal to the processor bound [11, 12] , Po = 2. Inspection of theschedulingrange chart (Figure 7 (b)), shows that the only operation that has mobility is 7. It can only be assigned to classes (1-2), or (2-3). Both possible schedules require 3 processors. Therefore, the search method proposed in [11, 12] fails in this case. -c-cy--c--c--H - scheduling and equivalence-class assignment> (a) to (9) . 
Processor assignment (h). Processor execution of three consecutive iterations (i).

Conclusions
A new approach for rate-optimal scheduling of recursive DSp algarithms has been introduced. It is based on using the information of the scheduling range of the operations, and is an intent to overbe moved within their scheduling range and scheduled such that the hardware resources are optimized. An algorithm based on this as been Proposed. The algorithm, although Simple, is powerful enough to solve many problems optimally.
[SI S.M. Heemstra [5] E.A. Lee and D.G. Messerschmitt.
Even more sophisticated methods, that schedule considering the effect of the scheduling of an operation on the total schedule can be applied, by converting the scheduling-range chart into a scheduling-range distribution chart where the height of the rectangle that determines the range is determined by the probability that the operation will be scheduled in that time slot. The area should b e equal to the duration of the operation. In practice, the life time of the variables should be minimized. So, the scheduling range is finite, an a modification of the force-directed method [8] can be applied.
