Numerous customary applications in digital signal processing may be characterized by synchronous data-flow graphs (SDFGs). Our previous experiments with SDFGs led to the development of a technique which constructs a static time-optimal schedule on a model system possessing limitless functional elements. In this paper, we now introduce a modification of this method for scheduling an SDFG to realize efficient performance in a more restricted environment.
Introduction
The model of the multi-rate or synchronous data-flow graph (SDFG), initially presented by Lee and Messerschmitt [1] , has long been established as a beneficial means for depicting and analysing applications in digital signal processing (DSP). An SDFG consists of vertices (or nodes) connected by edges, symbolizing data conduits joining functional units. In an SDFG, nodes use and create preset quantities of data tokens or delays during each operation. Moreover, edges may be preloaded with existing data tokens. As said, this representation is prevalent among engineers of DSP products, with many valuable outcomes concerning DSP design following from its use.
Our prior discussions of the subject [2] , [3] establish the reasons we have chosen to work exclusively with the robust SDFG model, rather than attempt to utilize simplified variations. During the course of this work, we have developed efficient algorithms for static scheduling and retiming. However, these results have been developed with the assumption of unlimited resources. While this is a very common assumption when developing such basic theory, it clearly does not represent a real-world setting. There is some work on resource-constrained scheduling and optimization of traditional data-flow graphs [4] - [6] , but our work in this paper and in [7] represents the initial investigation into effectively dealing with this issue when it arises with synchronous graphs.
In this paper, we will go over the fundamental definitions and results needed for detailing and working with SDFGs. We shall offer a polynomial-time method for scheduling the tasks of a given SDFG to complete execution in close-to optimal time within a resource-constrained system. To finish, we will utilize our approach to improve specific SDFG instances, thereby exhibiting its value. The work described herein represents a more detailed and efficient version of our early experiments reported in [7] .
Next, we review the established theory pertinent to our investigations. Subsequently, we present our scheduling algorithm as well as specific examples. To conclude, we sum up our research and identify courses for further inquiry.
Synchronous Data-Flow Graphs
We now formalize the fundamental concepts developed originally by Lee and Messerschmitt [8] for SDFGs for the sake of completeness.
Basic Definitions
As in our previous work, a multi-rate [9] , regular [10] or synchronous [1] data-flow graph is mathematically defined as "a finite, directed, weighted graph G = V, E, d, t, p, k where V is the vertex set of nodes or actors, which transform input data streams into output streams; E ⊆ V × V is the edge set, representing channels which carry data streams; d : E → {0, 1, 2, 3, . . .} is a function with d(e) the number of initial tokens (delays ) on edge e; t : V → {1, 2, 3, . . .} is a function with t(v) the execution time of node v; p : E → {1, 2, 3, . . .} is a function with p(e) the number of data tokens produced at source node of e to be carried by e; and k : E → {1, 2, 3, . . .} is a function with k(e) the number of data tokens consumed from e by the sink node of e" [2] , [3] . An SDFG is a singlerate or homogeneous data-flow graph if p(e) = k(e) = 1 for all edges e.
1

C o p y r i g h t ©
A C T A P r e s s Figure 1 . A sample SDFG.
A sample SDFG is shown in Fig. 1 . Throughout this paper, we will use square-shaped nodes to represent addition operations requiring one time unit to complete execution, while circles represent multipliers taking time two. For instance, in the figure, t(A) = t(C) = 2 and t(B) = 1. Production and consumption rates are given as small integers at an edge's ends. For example, consider the edge connecting node A to B, hereafter denoted (A, B). The notation indicates that a pair of data tokens is generated on this channel each time A runs, while B uses one of these each time it executes. Short bar-lines dividing edges, such as that cutting (C, A) and the pair on (B, C), symbolize pre-existing tokens to be used first by the edge's sink node.
An SDFG may be associated with an |E| × |V | topology matrix whose rows and columns map to the graph's edges and vertices, respectively. The (i, j)th entry in the matrix summarizes information regarding the passage of data tokens into or out of node j on edge i during each invocation. If this entry is positive, j produces this amount of data; if negative, j consumes data from the edge; if zero, j is not incident on i. For instance, Fig. 1 has as its topology matrix:
For purposes of this representation, nodes A, B and C are designated nodes 0, 1 and 2, respectively, while the edges are numbered in the order (A, B), (B, C) and (C, A).
Lee and Messerschmitt [1] demonstrated that an SDFG G must have a topology matrix M with rank |V | -1 in order for a repeating sequential schedule to exist. Under these circumstances, G is consistent and there exists at least one vector q in M 's null space whose elements are positive integers, called a repetition vector for G. The basic repetition vector (BRV) for G [11] is then the repetition vector possessing the smallest norm. We designate the entry of the BRV q corresponding to node u with the notation q u .
In a single-rate graph, an iteration is an execution of all nodes in the graph once. The purpose of the BRV is to specify the number of copies of the individual nodes that must execute during every static schedule iteration of a multi-rate graph. Returning to our example in Fig. 1 , the depicted SDFG has a BRV of q = [1 2 1] T , indicating that each iteration of the schedule must include the invocations of a pair of copies of B and one each of A and C.
As noted in our prior work [2] , [3] , consistency is only the first requirement for a SDFG's static schedule to exist. The other is liveness, defined in [11] as the existence of a deadlock-free schedule. As it is not relevant to the topic at hand, we will not discuss liveness further herein. Our work in [2] , [3] goes into detail on SDFG liveness tests and the interested reader is encouraged to consult these papers.
The Iteration Bound of a SDFG
As said above, an iteration is simply a single execution of all copies of all nodes of a SDFG. The iteration bound of the SDFG is then the average computation time of one iteration. In synchronous graphs containing loops, this figure is bounded from below by the graph's maximum cycle mean or iteration bound [12] . We use the notation B(G) to represent G's iteration bound, derived by computing the time-to-delay ratios for all cycles in the graph and retaining the maximum of these figures. While complicated to derive for SDFGs in general, as shown in [2] , we can quickly estimate this value via the inequality:
where the synchronous graph G has a basic repetition vector of [q 0 , . . . , q |V |−1 ]. We will denote the fraction on the right in the above inequality as β(G) and cite it as a useful estimate for the iteration bound from now on. Returning once more to Fig. 1 , the depicted graph includes one loop with an adjusted delay count of two and node computation times summing to 5; therefore β(G) = 
Static SDFG Scheduling
As shown in [2] , there are two models we consider when discussing static scheduling. Formally, a function s : V × {0, 1, 2, 3, . . .} → {0, 1, 2, 3, . . .} is an integral schedule for a SDFG G where s(v, i) is the starting time of node v in iteration i (i ≥ 0). Likewise, a function s : V × {0, 1, 2, 3, . . .} → {r ∈ Q | r ≥ 0} with the same stated properties is a fractional schedule for G. In either case, a s is legal if, for each edge e = (u, v) and
. Such schedules are repeating for cycle period c if s(v, i + 1) = s(v, i) + c for all nodes v and iterations i [13] . Repeating schedules may be characterized by their initial iterations, given the fact that the full legal schedule may be constructed by commencing a new copy of the partial timetable every c clock ticks. Finally, a schedule is static or processor-static if a node's execution is designated to proceed on the same functional unit in every partial schedule instantiation [13] .
As a final point, there are two design approaches to consider when devising a static schedule [13] . Our work deals with schedules following a non-pipelined implementation wherein the next occurrence of a node cannot proceed until the prior copy has terminated execution, creating an inherent precedence relation between successive 2
C o p y r i g h t ©
A C T A P r e s s incidences of the same node. Alternately, schedules following a pipelined implementation are not subject to this limitation.
Scheduling Algorithm
Our scheduling algorithm appears as Algorithm 1. In this section, we will review each step, demonstrating its effectiveness on Fig. 1 assuming a system with one adder (requiring execution time 1 as previously noted) and one multiplier (taking two time units).
Deriving a Target Clock Period
SDFG scheduling requires specification of a clock period to attempt to achieve. In an environment without resource constraints, we could simply use the iteration-bound estimate. However, the lack of adequate resources may prevent us from achieving this optimality. The key to deriving a more easily achievable starting point lies within this result.
Lemma 3.1. Let G = V, E, d, t, p, k be a SDFG with BRV q. Define V X ⊆ V to be the nodes of G requiring a type-X non-pipelined functional unit for execution.
Further define F X to be the number of available type-X non-pipelined processors, each requiring time T X ≥ 1 to perform an operation. Then the cycle period for a resource-constrained schedule of G is bounded from below by:
The result is as in [13] with the number of operations using a type-X functional unit replaced by its value in a SDFG, the sum of elements from the BRV for nodes in the SDFG using type-X functional units. When working in a resource-constrained environment, begin by selecting a target clock period larger than all of the lower bounds from Lemma 3.1 for each resource type, as well as the iteration bound. This way, we consider both the theoretically best value as well as figures that take the realities of the target system into account.
As multiple copies of some nodes must be scheduled at the same time, we begin by reweighting each node's execution time, essentially distributing the copies of a node across the available functional units. For example, in Fig. 1 , the only node with multiple copies is B. With only one adder, t (B) = 2 as two copies of B will eventually have to be scheduled one after the other. If we had two adders, t (B) = 1 because the copies could be scheduled to execute simultaneously on the different functional units.
In our example from Fig. 1 using only one functional unit of each type, the target value derived from the iteration bound would be 3. The only node needing an adder is B (with q B = 2), so the lower bound from the adder is 2 · 
Predecessors and Descendants
Because we will need the information when we sort out resource conflicts later, we count the number of descendants for each node along zero-delay edges. Begin by removing edges with positive delay counts from the graph. The resultant structure must be acyclic because SDFGs containing zero-delay cycles cannot be scheduled as noted above. We can therefore topologically sort the vertices in this directed acyclic graph (DAG) via depth-first search, as shown in [14] . This algorithm can be modified to count descendants without adding extra complexity by passing partial descendant counts from finalized children to their parents, who total these figures to arrive at their total descendant counts. In the case of our example, we have only one non-zero descendant count, namely 1 for A.
Constructing a Scheduling Graph
As in [2] , we now construct a scheduling graph G s = V ∪ {v 0 }, E , ω, τ , p, k , a model that encapsulates all information we will ultimately need to create our schedule. For the resource-constrained problem, this requires two steps, as seen in Algorithm 2. First, the edges are reweighted to preserve precedence relations among the nodes. Finally, in order to uncover negative-weight cycles (indicating an infeasible clock period [2] ), we insert a dummy node v 0 as well as directed edges which are assigned weights of zero from this new node to every other node in G. As an illustration, the scheduling graph for Fig. 1 with clock period 4 is shown in Fig. 2 .
SDFG Scheduling
After this, the Bellman-Ford single-source shortest-path algorithm [14] is applied to the scheduling graph. If the algorithm finds a cycle having total weight less than zero in the graph, the given clock period is infeasible, and the algorithm would fail. Fortunately the adjustment we made earlier to the node computation times will prevent this from happening. We can therefore derive the shortest path weights from v 0 to every other node in the scheduling graph. This information is used by SDFG scheduling (Algorithm 3) to derive the starting times for all nodes in the first schedule iteration. These can then be repeated every c steps to generate the entire schedule. As in [2] , this method defines a legal, repeating, static schedule under either of the design styles described above.
Returning to the example of Fig. 1 , we examine the scheduling graph and see that s(A) = s(C) = 0 and s(B) = − 1 2 , giving start times of 0, 2 and 0 for A, B and C, respectively. This schedule is shown in Fig. 3(a) . As can be seen, the two copies of B appear back-toback on the lone adder. However, this schedule overlaps the executions of the two multiply operations, violating the resource constraint. We require additional work to compensate for this. 3
C o p y r i g h t ©
A C T A P r e s s
Algorithm 1. Creating a resource-constrained static schedule
Input: A SDFG G = V , E, d, t, p, k with BRV q, a specified system and model 
Correcting for Resource Constraints
Due to the complexity of processor assignment problems in general, adjusting for resource conflicts becomes the most complicated phase of this algorithm. Indeed, the most significant contribution of this work beyond what appeared in [7] is the improvement to this part of the process. We begin adjusting for resource conflicts by clearing all previous resource assignments and making all resource units available. These are then inserted into search trees corresponding to resource types, keyed by earliest availability (which is initially −1 for all instances). We then sort the vertices, first by ascending start time, then by decreasing number of descendants in the SDFGs directed acyclic counterpart, as is done in list scheduling [5] , [13] . In our example, this sorted list is {A, C, B} based on the respective keys of (0, −1), (0, 0) and (2, 0). this occurrence, we do so. Otherwise, we continue looking for an acceptable alternate. If none can be found, we return to the first one discovered and add a zero-delay edge from the node currently scheduled on this unit to the one needing service. (While unnecessary for this process, for completeness, the production and consumption rates for an added edge (u, v) can be set to the least common multiple of q u and q v .) This leaves this new node unassigned to any system component leading to further iterations of the entire process until all nodes are scheduled.
Returning to our example, node A is assigned to the lone multiplier, with the device's availability set to time step 2. Now, when we attempt to schedule C at step 0, we find no available resource. The one which will be free most quickly is currently working on A, and so we add a new zero-delay edge (A, C) to the working copy of G and repeat the entire process. Four is still a valid target clock period, so proceed to construct the new scheduling graph shown in Fig. 4 . This time, the calculated shortest paths are s(A) = 0 and s(B) = s(C) = − 1 2 , yielding starting times of 0, 2 and 2, offsetting the start of C long enough so that it can share the multiplier with A, as shown in Fig. 3(b) .
As said, this method of handling resource constraints is the single largest difference between this work and our initial treatment of the subject [7] . It is also the biggest improvement over older methods such as Hu's algorithm [5] , [15] and list scheduling [5] , [16] . Not only is our method applicable to general-time SDFGs but also it does not rely on the construction of a complete resource conflict graph. Rather, it builds only the needed edges into the original SDFG.
Timing Analysis
Entering the initial loop, the value of the iteration bound may be determined in O(|V ||E| log |V |) time assuming upper bounds on the total delay count and total computation time [17] , [18] . The resource lower bounds are calculable in O(|V |) time by linearly sorting all nodes by resource type than performing the prescribed computations.
As described above, performing the topological sort on the DAG and calculating descendant counts is completed in O(|V | + |E|) time [14] . Both the construction of the scheduling graph [13] and the execution of the revised Bellman-Ford algorithm [14] will require O(|V ||E|) time. The time complexity of SDFG scheduling (Algorithm 3) is O(κ|V |) where κ is the largest element of the BRV. While this figure is unknowable, in general, we may reasonably assume that it is a small constant and that this is a lineartime algorithm. 5
C o p y r i g h t ©
A C T A P r e s s Figure 5 . A simplified spectrum analyser.
We then move on to resolving resource conflicts. If R is the set of available resources, clearing resource assignments and building search trees take O(|R|) time. By the nature of the problem, we may assume that |R| |V |, making this relatively inexpensive. Sorting the nodes then requires O(|V | log |V |) time [14] . Considering the inner loop, searching for resources will cost O(log |R|) if we use balanced search trees such as red-black trees [14] to store the resource instances. Failing to find a suitable device is a constant-time operation if we retain the first unit found; otherwise it costs us O(log |R|) to update the devices information and reinsert it into the proper tree [14] . As we do this for each vertex, this phase of Algorithm 1 is an O(|V | 2 log |V |) action. Lastly, the least desirable situation regarding the outer loop occurs if all operations are assigned to the same functional unit, so that O(|V |) iterations of this loop are required. We thus conclude that this method requires O(|V | 3 log |V |) time at worst, a dramatic improvement over the initial version of this technique from [7] . This estimate may improve to O(|V | 2 |E| log |V |) were we to apply a linear-time sorting algorithm (such as counting, radix or bucket sort [14] ) to the vertices of a sparse graph during resource conflict resolution.
A Simplified Spectrum Analyser
As further proof of the effectiveness of our method, consider the adaptation of the simplified spectrum analyser initially appearing in [19] , repeated as Fig. 5 (a) with vertex explanations in Fig. 5(b) . The analyser possesses a basic repetition vector of q = [16 1 1 1 4 1] T . Assume that nodes C and F require a different type (call it type 2) of functional unit than do the other vertices. We will attempt to optimally schedule it on a system with one type-2 resource instance and four type-1 units to be shared among all nodes except C and F . Due to space restrictions, most of the transitional steps in the scheduling algorithm will not be pictured. Rather, we will explain the applicable facts.
There are two loops present in Fig. 5(a) . The path B → C → D → F → B yields a time-to-delay ratio of 3, while A → B → C → D → E → A has an adjusted ratio of 9 2 . The resource lower bound for type-1 units is 22 4 = 6 and that for type-2 is 4. We thus accept 6 as our initial target clock period. Further analysis reveals two descendants each for A and F , and one each for B and D.
When constructing the corresponding scheduling Figure 6 . Final resource-constrained schedule for the spectrum analyser.
graph, we find that As we begin resource conflict resolution, our sorted task list is {A, F, D, E, B, C}. The four copies of A take up the four appropriate devices, while F occupies the lone type-2 instance. D and E can then not be assigned, requiring the insertion of the edges (A, D) and (A, E). Nodes B and C can then be scheduled, but another pass is still required.
During the second pass, the new cycles A → D → E → A and A → E → A have time-to-delay ratios of 6 and 5, respectively, so 6 is still our target clock period. The new edges (A, D) and (A, E) have weights of − 2 3 in the scheduling graph, giving start times of 0, 4, 5, 4, 5 and 0, respectively, to the nodes in our revised schedule. All descendant counts except A's (which increases to 4) remain the same, giving a sorted list of {A, F , B, D, C, E} this time. As shown in Fig. 6 , all node instances can now be scheduled without further incident.
One weakness of this method illustrated by this example is its inability to extend beyond one iteration. If we attempt to schedule the next occurrence of node F , we will have conflict with the first occurrence of C. We can solve this with minor ad hoc modifications. In this example, if we add the one-delay edge (C, F ) to our graph to represent this clash, the target clock period would remain 6, the length of the shortest path to F would become − 1 6 and the start of F would be delayed one time step, a change which does not affect the rest of the repeating schedule and which solves the stated problem. Future work is needed to develop a more formal method for dealing with this situation. 6
C o p y r i g h t ©
Conclusion
The problem of efficiently and optimally scheduling tasks in a system represents a crucial challenge in both highlevel VLSI synthesis and compiler design for parallel machines. In this paper, we have defined and demonstrated the first static scheduling method applicable to real-life systems with limited resources which are modelled by SDFGs, a paradigm commonly used to represent not only DSP applications but also the flow of data between vector processing units in modern supercomputers. This work thus represents a significant initial step towards solving an important problem. We have improved on our original work in [7] by making use of better data structures and refined methods to make the process more efficient. This said, there is still much work to do. Our examples here and in [7] demonstrate the further enhancements to be obtained by considering multiple schedule iterations which overlap. This method as defined is not flexible enough to permit this superimposition yet. As we continue to develop the theory of SDFGs, we hope to be able to dramatically improve the effectiveness and efficiency of this technique.
