Abstract-Many common iterative or recursive DSP applications can be represented by synchronous data-flow graphs (SDFGs). A great deal of research has been done attempting to optimize such applications through retiming. However, despite its proven effectiveness in transforming single-rate data-flow graphs to equivalent DFGs with smaller clock periods, the use of retiming for attempting to reduce the execution time of synchronous DFGs has never been explored. In this paper, we do just this. We develop the basic definitions and results necessary for expressing and studying SDFGs. We review the problems faced when attempting to retime a SDFG in order to minimize clock period, then present algorithms for doing this. Finally, we demonstrate the effectiveness of our methods on several examples.
I. INTRODUCTION
Since the most time-critical parts of DSP applications are loops, we must explore the parallelism embedded in the repetitive pattern of a loop. One of the most useful models for representing DSP applications has proven to be the multirate or synchronous data-flow graph (SDFG) first proposed by Lee [15] . The nodes of a SDFG represent functional elements, while edges between nodes represent connections between them. Each node consumes and produces a predetermined fixed number of delays (i.e., data tokens) on each invocation. Additionally, each edge may contain some initial number of delays. This model has proven popular with designers of signal processing programming environments [11, 13, 21, 27] with its use leading to numerous important results regarding the scheduling [9] , hierarchization [24] , vectorization [23] and multiprocessor allocation [10, 15] of DSP programs.
A great deal of research has been done attempting to optimize various aspects of an application's execution by applying various graph transformation techniques to the application's SDFG. One of the more effective of these techniques is retiming [17, 18] , where delays are redistributed among the edges so that hardware is optimized while the application's function remains unchanged. Retiming was initially applied to single-rate DFGs to optimize the application's schedule of tasks so that the clock period of the graph (i.e., the total computation time of the longest zero-delay path) was decreased in order for the application to be more efficiently scheduled for execution on multiprocessors [4] [5] [6] . It was later extended to the more general SDFG model in order to extend vectorization capabilities [29] or minimize the total delay count of a SDFG [28] . However, the problem of using retiming to minimize the clock period of a multirate DFG has remained unexplored. In this paper, we will discuss this problem and propose a method for accomplishing this task.
The benefits of retiming single-rate data-flow graphs are widely reported in the literature. (The interested reader can find several cited in [19] .) However, reworking our retiming methods so that they may be applied to multirate DFGs is not easy. The difference between the single-rate and multi-rate models lies in the specification of production and consumption rates on each edge; in single-rate graphs all such rates are assumed to be the same, whereas different rates for different edges are typically specified when constructing SDFGs. Two pitfalls were noted in [30] . First of all, a retiming may be derived for a single-rate DFG by solving a linear programming problem [18] . The introduction of rates on the edges potentially changes this to a more complicated integer linear programming problem. We later show that this particular ILP system exhibits special properties which permits an efficient solution. Second, the introduction of rates invalidates the traditional This work is partially supported by NSF grants MIP95-01006 and MIP97-04276, and by the A. J. Schmitt Foundation.
results regarding the delay counts of paths and cycles, depriving us of many useful results derived for the single-rate case. Specifically, in the single-rate case, we seek to remove zero-delay paths with excessive total computation times. It isn't clear what we want to avoid in the multi-rate case; a specific delay count on one path may or may not be adequate, depending on what rates have been specified.
The most popular method to date for retiming SDFGs was to avoid the problem entirely by translating the SDFG to its single-rate equivalent and retiming this new graph [12] . The possibility of then translating this new graph back to an equivalent retimed SDFG was mentioned in [28] . Unfortunately, as we will demonstrate, it may be impossible to translate a retimed single-rate graph back to a retimed SDFG. The original idea is flawed as well, in that performing retiming only once replaces an SDFG with a dramatically larger single-rate graph, complicating future stages in the design phase. Clearly it would be preferable to deal with the smaller SDFG as much as possible during this future work once retiming is completed, and while the retiming algorithms we will propose do rely heavily on the much larger single-rate equivalent at intermediate phases, the final product is a still a SDFG.
In this paper, we will develop the basic definitions and results necessary for specifying and manipulating a SDFG and its single-rate equivalent. We will review retiming and point out the problems which arise when it is applied to SDFGs. We will propose polynomial-time algorithms (in the size of the SDFG's single-rate equivalent) which retime a given SDFG to have a specified clock period. Finally, we will demonstrate the effectiveness of our algorithms by applying them to several examples, in all cases achieving a provably minimal clock period.
In the next section, we will formalize the fundamental concepts related to the study of synchronous data-flow graphs. We then discuss retiming and the problems we face as we apply it to SDFGs. Next are our retiming algorithms, followed by detailed examples. Finally, we summarize our work and point to future directions for study.
II. SYNCHRONOUS DATA-FLOW GRAPHS
The concept of a synchronous data-flow graph was developed and used extensively by Lee and Messerschmitt [14] [15] [16] and later by Zivojnovic et al [25, 28, 30] . In this section, we review their definitions and ideas in order to formalize these concepts.
A. Basic Definitions
A synchronous data-flow graph (SDFG) (sometimes called a multirate or regular data-flow graph) is a finite, directed, weighted graph ¥ is the vertex set of nodes or actors, which transform input data streams into output streams; 2.
is the edge set, representing channels which carry data streams; 3.
) ( 
, we say that is a homogeneous data-flow graph (HDFG). HDFGs are also sometimes referred to as single-rate data-flow graphs or simply data-flow graphs.
To illustrate, consider the SDFG given in Figure 1 (a) below. The numbers above the nodes represent the execution times for the individual tasks, while the smaller numbers at either end of an edge denote tokens produced or consumed. As an example, . All other entries are zero. As an example, the topology matrix of Figure 1 (a) is given in Figure 1(b) .
In [15] it was demonstrated that a repeating sequential schedule can be constructed for a SDFG if the rank of the graph's topology matrix is one less than the number of nodes in the SDFG. (The reverse is not necessarily true, as we will see shortly.) If this condition holds there is a positive integer vector in the nullspace of the topology matrix. The vector with the smallest norm from this nullspace is called the repetitions vector (RV) (or basic repetition vector in [2] ) for . For example, the RV for the SDFG in Figure 1 
B. Constructing an Equivalent HDFG
In order to study an SDFG, it is sometimes useful to create its equivalent homogeneous data-flow graph (EHG). As the name implies, an EHG performs the same function as the original SDFG, but is constructed so that each edge carries at most one token. Since each node is expecting to either produce or consume more data than this, an EHG compensates by inserting multiple edges between nodes. Algorithms for creating EHGs appear in [2] and [26] . In general, they first create enough copies of each node to satisfy the specifications of the RV. They then insert edges. If nodes in a SDFG are connected by a zero-delay edge, then the first data token produced by the first copy of the source must be consumed by the first copy of the sink in the EHG. If there are delays on an edge, the data contained here is consumed first, so that the first new token produced is in fact needed by a later copy of the sink. Such algorithms determine which copies of source and sink to map to one another based on how much data has been created and used. As an example, the EHG of Figure 1 (a) appears in Figure 3 (a). Note that, for purposes of clarity, we do not combine edges between nodes as is typically done. If multiple tokens are to be sent between nodes in the EHG, each travels along its own edge.
Finally, as derived in [2] and [7] , we will say that a SDFG is live if its EHG has no zero-delay cycles. Otherwise the graph is deadlocked. An example of a consistent deadlocked graph appears as Figure 3 (b), with its EHG in Figure 3 (c). As we can see, the loop between nodes e and i f y contains no delays, and so it is impossible to schedule them since each must precede the other. It should be clear that a SDFG must be both live and consistent in order for it to have a repeating static schedule.
C. The Delay Count of a Path
In [28] , Zivojnovic et al briefly discussed computing the cumulative delay count of a path in a SDFG, the sum of the delay counts of all copies of the path in the EHG. They omitted many details which would have aided understanding of their ideas, and then added to the confusion with typographical errors. Because of the necessity of these results for complete understanding of the SDFG model, we will now review, clarify and expand their line of reasoning.
A path in a SDFG is an ordered sequence of nodes and edges. Given a node , we will define the rate gain of
. This figure provides some measure of a node's production. If the rate gain is larger than R , the node is producing more than it is consuming; otherwise we are taking more than we are giving. For example, in the path 
. Then the cumulative delay count of
The idea is to push all delays down the path and onto the final edge 
Continuing in this fashion, we eventually arrive at
The chief issue with this result is the consistency of the delay count for a cycle. As an example, consider the cycle Figure 1 (a). When the nodes are visited in this order, the cycle contains ¥ total delays. However, when visited in the order r ) 1 ¢ e 1 ¦ i § 1 r , the delay count becomes T . There is still some gap in our understanding which requires further investigation.
III. RETIMING
A great deal of research has been done attempting to optimize the schedule of an application's tasks after applying various graph transformation techniques to the application's HDFG. One of the more effective of these techniques is retiming [17, 18] , where delays are redistributed among the edges so that the application's function remains the same while the execution time decreases. Despite its usefulness when applied to HDFGs, the application of retiming to SDFGs was explored only marginally prior to 1994 [12, 20] before being studied by Zivojnovic et al primarily as a way to minimize the delay count of a SDFG [28, 30] . In this section we intend to review the basics of retiming, explore some of the pitfalls which arise when studying retiming of SDFGs, demonstrate the effectiveness of retiming, and propose two algorithms for retiming SDFGs.
A. Basic Definitions
As we've said, a path in either a SDFG or a HDFG is any sequence of nodes and edges. The clock period © @ s t D of a HDFG is then defined to be the length of the longest zero-delay path [3] . This definition is problematic if we attempt to apply it directly to SDFGs, as we can see if we do so to Figure 3 . Thus, we are forced to define the clock period of a SDFG to be equal to the clock period of its EHG. As an example, the clock period of the SDFG in Figure 1 (a) is ¥ by this definition. Similar problems arise when we attempt to minimize the clock period. We will say that an iteration of a SDFG is the execution of all nodes of its EHG once. The average computation time of one iteration is then called the iteration period of the SDFG and is equal to the iteration period of the EHG. (In Figure 1 (a) the iteration period is also ¥ .) If a SDFG contains a loop, then the iteration period is bounded from below by the iteration bound [22] , which is defined to be the maximum time-to-delay ratio of all cycles in the EHG. For example, the EHG in Figure 3 T . This can be clearly seen from the schedule in Figure  2 (a), where overlapped iterations create higher throughput. (The iteration period of an SDFG can be overestimated using the ideas from [25] without constructing the EHG, but our method yields a tighter bound, which is important as we attempt to minimize the iteration period of an SDFG next.)
A retiming
is a function which specifies a transformation of a graph . It labels each vertex with a factor by which production and consumption rates are multiplied when computing the delay counts of the edges in the transformed graph. The effect is to change into the retimed graph
for all edges B ² b ³ ©
. As an example, a legal retiming with
transforms the SDFG of Figure 1 (a) into that of Figure 4 (a). Examining the EHG in Figure 4 (b), we see that we have now achieved an optimal [28] , the retimed delay count of a path
B. Problems Retiming EHGs
On first glance, it appears that we should just be able to retime the EHG via traditional methods and then map back to the original SDFG, as was proposed by Zivojnovic [28] . Unfortunately, the initial translation from SDFG to EHG is too complex to permit this. As an example, consider the unit-time SDFG given in Figure 5 (a), with its EHG appearing in Figure 5 should have non-zero delay counts, which also contradicts what we have. In any case, there can be no direct matching in this case. If we are to retime SDFGs, we must work directly on the original graph itself.
IV. RETIMING A SDFG
Since we cannot retime a SDFG by working with its EHG, we must develop methods for retiming the SDFG directly. In this section we refine the methods of [18] to deal with this situation.
A. Initial Problems
Unfortunately, the retiming algorithms we will propose will either be pessimistic or expensive. The reason for this is that the original methods we are using as a basis were themselves built on one result from [18] : 
R V
The problem now is that insufficient delays along a path in a SDFG do not necessarily translate into a zero-delay path in the EHG. As an example, consider the unit-time SDFG in Figure 6 (a) below, with its EHG given in Figure 6 (b). For Ì ¡ T , examining only the original Another additional cost that the problem of insufficient delays forces us to pay comes in the form of additional checks for legality. In the original algorithms from [18] , only one delay at a time was moved, a stipulation which did not cause the proposed retiming to become illegal at any intermediate step (as proven in [18] ). Because we are now pulling groups of delays through nodes, this situation no longer exists, and so we will have to check for legality at every stage of an algorithm.
The question now is to determine exactly how many delays to view as sufficient. Let edges for these data. We will use either of these figures as the number of tokens required by an edge in the SDFG during retiming.
B. First Method: Linear Programming
We begin with a mathematical solution based entirely on the following idea. 
, where 
for all B , both results derived from definitions. Simple algebraic manipulations now yield the desired result.
The remaining criteria ensure that there is no zero-delay path in the EHG with computation time too large. If
then the path from¯to F contains a subpath which must have positive delay count, so it suffices to consider only the cases we've specified. The conditions for (2) dictate that the path from Õ to F has small enough computation time; we only have a problem when we add an initial edge from¯to Õ to the path. Therefore, if we make sure that the retimed delay count for this initial edge is large enough for every copy of the edge in the EHG to have a non-zero delay count, each copy of the path in the EHG will have non-zero delay count and the conditions of Theorem IV.1 will be satisfied. Since there are Ð P @ C B 7 D copies of this edge B in the EHG, we want «@ C B 7 D to exceed this figure, leading to the stated criterion. Condition (3) is derived in the same manner, dealing with the final edge in the path rather than the first edge. This result leads to an algorithm which may perform wasteful operations; it may be possible that the delay counts of the other edges in the path are sufficient so that retiming is unnecessary, but we will retime anyway. Because of this result, we can construct a system of linear inequalities which can be solved in polynomial time by the BellmanFord Algorithm [8] . Furthermore, because we are working with values derived from the EHG, we will avoid the false path problem.
Making use of this idea requires us to calculate Ô and , the maximum computation time and minimal delay counts along paths between nodes, respectively. Algorithm 1 below is based on the method of [18] , constructing a matrix ã and manipulating it via the Floyd-Warshall allpairs shortest-path algorithm [8] 
is a copy of¯and F is a copy of F in the EHG. We are forced to work with the EHG due to the problems we noted earlier regarding the calculation of a path's delay count.
As an example of the potential wastefulness, let us demonstrate our ideas on the path in Figure 6 (a) first with 0 ¡ ç T
. Recall that this example does not need to be retimed given these conditions. There are only two edges we are considering, so the first condition of Theorem IV.2 gives us two inequalities: 
Since the second system supersedes the first, it is sufficient to solve only the second and derive a solution with 
end if end for end for for all edges
end if end for end for end for for
end for end for that the graph was already optimal, we see that our algorithm costs us a great deal in this case.
For another interesting example, consider Figure 7 (a) below with 0 ¡ ¥ . (Since all nodes in the graph take ¥ time units to execute, this is the smallest clock period we can hope to achieve.) It's EHG is given in Figure 7 (b). The Ô and values derived from this graph are given in [19] ; suffice it to say that, for each pair of vertices¯and 
Condition 1 also gives us Q inequalities to satisfy based on edges in the original SDFG. However each of these inequalities is replaced by one of the tougher restrictions we have just seen, so it suffices to consider only the constraints derived above.
Before we can proceed, we must multiply each of our equations by properly chosen constants so that the coefficients of each variable occurrence match. Completing this exercise leaves us with the system
As in [8] , we can now model our set of inequalities by the constraint graph of Figure 8 to all other nodes to get an answer of
. Since we prefer positive retimings, we normalize this answer by adding T¥ to all values before dividing to produce our final answer of
. (The sequence of events at this step is crucial; the reader can verify that dividing and then normalizing yields an incorrect answer.) Applying this function to the SDFG of Figure 7 (a) yields the retimed graph of Figure 8 (b) whose EHG appears in Figure 8(c) . An examination of this EHG reveals that we have indeed found a retiming which achieves our desired clock period. We also see that, due to the different rates of production and consumption by each of the nodes, the delay counts in the cycles no longer appear to match.
C. Second Method: Relaxation
Alternately we can more efficiently seek our retiming via relaxation on the edges of our graph. We do this by topologically sorting our vertices (so that¯precedes F if there is a zero-delay edge
in the EHG [8] ) and then sweeping along the sorted list. When we get to a point where the current path is too long, we insert enough delays so as to break the path up into sufficiently small pieces. We then verify that we are allowed to do this. If we cannot then there is no retiming and we return with an error; otherwise we sweep further. Once our prospective retiming has been found, we test the retimed graph to make sure that the clock period is within our requirements. If it is we have found a way to retime the SDFG; otherwise there is no such retiming.
We begin our construction by considering Algorithm 2 below, the
-time algorithm from [18] for finding the length of the longest zero-delay path into each vertex of a HDFG. This procedure first sorts the vertices so that those occurring early in the list are connected to vertices later in the list by zero-delay edges. It then traces through the list, associating each vertex with the length of its longest zero-delay path. If a vertex is not connected to a previous one, its path length must equal its own computation time; otherwise its path length equals its own time, plus the sum of the times of all the other vertices found along the path to this point. We require this algorithm not just for constructing our retiming, but also for verifying that our final retimed graph executes within the required time frame. 
end if end for return

U
With this in hand we can now proceed to our primary method, given as Algorithm 3 below. We begin by retiming our SDFG with the result to date and constructing its EHG. The EHG is then handed to Algorithm 2 to find the lengths of the maximum paths to all vertices. At this point, if the longest path length is sufficiently small, we return our current retiming function as the final answer. Otherwise the vertices in our SDFG fall into one of two groups. If all copies of a vertex in the EHG are isolated (i.e., connected to the rest of the graph only by edges containing delays), we do not wish to retime the node and remove it from consideration. Otherwise a copy of the node¯lies along some zero-delay path in the EHG and we may have to retime it. In this case we assign it a longest path length
equal to the longest path length of any of its copies in the EHG.
We now consider nodes for further retiming. Since we want to push delays forward along our paths (rather than pulling them backward as was done in [18] ), we retime those nodes which occur early in a path. This process is complicated by the different rates of production and consumption on each node. For example, for each delay drawn into node e in Figure 7 (a), three delays are pushed onto the edge from e to i and two delays onto the edge from e to r . Therefore, for each outgoing edge from such a node, we calculate the number of delays needed to retime all copies of the edge in the EHG, subtract the number of delays currently on the edge, adjust for the different rates of production and consumption, and retime by the maximum of these needs. Once all nodes are retimed, we test the prospective retiming for legality, i.e. we check that retiming by our function doesn't result in some edge containing a negative number of delays. If we pass this test, we look further along our path for other nodes in need of retiming.
Once we have checked all nodes at least once and have derived a legal retiming function, it is time to test our answer. We repeat our earlier steps to find the lengths of the maximum zero-delay paths to each node one last time. Since the length of the largest zero-delay path in the EHG equals our clock period, this value is tested against our requested clock period. If it is still too large we cannot retime this SDFG to execute in the time we wish and must return with an error. Otherwise we have found our retiming.
We now demonstrate our method by executing it on the SDFG of Figure 7 (a) with ¡ ¥
. Sorting the vertices of Figure 7 (b), computing longest path lengths and taking the maxima reveals that 
. The new retimed graph is given below as Figure 10 (a), with its EHG in Figure 10( 
. The application of this retiming to the original SDFG results in the graph of Figure 8 (b) and we have found our answer.
However we have a final pass of the algorithm to perform. We construct and study the EHG of Figure 8 (c), find that the maximum zerodelay path is an individual node with computation time ¥ , conclude that we have found our retiming and return it as our answer before proceeding to the inner nested loops. 
D. Discussion
Our first method is a straight-forward problem in linear programming but constitutes a very expensive solution. If 
T ¥
time to execute the Bellman-Ford algorithm to find the shortest path lengths in our constraint graph. There is no guarantee that this method will yield an answer, and even if it finds a solution, it does so at a great price.
On the other hand, our second procedure is more efficient than our first method and more intuitive. Since the construction of an EHG and Algorithm 2 each require
time. However, while we suspect that its success is both a necessary and sufficient condition for a SDFG to be retimable to a given clock period, it is unknown whether or not this is the case. In our defense, the algorithm from [18] upon which this method is based was also never proven both necessary and sufficient, but has been extremely useful in practice. We suspect that the algorithm we have described here will prove just as valuable despite this logical gap.
V. EXAMPLES
In this section, we illustrate our methods further by applying them to various SDFGs found in the literature. Additional examples may be found in [19] .
A. First Example
Let us begin with a slowed version of the example from [30] , given in Figure 11 (a). The RV is ¡ ¢ R G G ¥
. We will attempt to achieve a clock period of U , which equals the execution time of node e and hence is minimal. If we apply our relaxation algorithm, we complete execution in two passes. The first time through computes longest zero-delay path lengths of . This is a legal retiming, and when we begin the next pass of our loop, we find that it is adequate for retiming Figure 11 (a) to have clock period U , thus terminating execution of our algorithm with the retimed graph in Figure 11 
B. A Simplified Spectrum Analyzer
Finally, let us apply our algorithms to a variation of the simplified spectrum analyzer from [28] which appears in Figure 12 (a), with node descriptions in Figure 12(b) . This graph has a RV of ¡ R í R R R ¥ R
, so in the interests of space we will not display the EHG at each step. Instead, we shall describe the pertinent information. It can be shown that the lower bound on the clock period for this SDFG is U , and so we will attempt to retime it to be optimal. there are already sufficient delays on an edge, the value of the retiming for the edge's source node will not change.) Applying this retiming to the graph in Figure 12 (a) yields the graph of Figure 13 . In the next pass of the algorithm, we first check the clock period of this retimed graph, find that we have achieved an optimal retimed graph, and return the current value of the retiming as our final answer. In this paper, we have established a notation for expressing and studying retimings of synchronous data-flow graphs. We have presented the difficulties involved with retiming SDFGs, and then constructed a polynomial-time algorithm (in the size of the SDFG's homogeneous equivalent) for retiming a synchronous graph so that it achieves a sufficiently small clock period. Finally, we have demonstrated the effectiveness of our algorithm on several examples. In all cases we have studied, we have been able to achieve minimal execution times, indicating the strength of our second algorithm.
Regardless of how good our algorithm may be, it is still not proven to represent both necessary and sufficient conditions for retiming. This proof, or the construction of an alternate method which is necessary and sufficient, remains an interesting open problem. A better grasp of the results regarding delay counts in [28] will definitely lead to greater understanding of our model and may open the door to removing this logical gap. It may also lead to a study of retiming applied to even more complicated models, such as cyclo-static [2] or dynamic DFGs.
