Many common iterative or recursive DSP applications can be represented by synchronous data-flow graphs (SDFGs). A great deal of research has been done attempting to optimize such applications through retiming. However, despite its proven effectiveness in transforming single-rate data-flow graphs to equivalent DFGs with smaller clock periods, the use of retiming for attempting to reduce the execution time of synchronous DFGs has never been explored. In this paper, we do just this. We develop the basic definitions and results necessary for expressing and studying SDFGs. We review the problems faced when attempting to retime a SDFG in order to minimize clock period, then present an algorithm for doing this. Finally, we demonstrate the effectiveness of our method on several examples.
Introduction
Since the most time-critical parts of DSP applications are loops, we must explore the parallelism embedded in the repetitive pattern of a loop. One of the most useful models for representing DSP applications has proven to be the multirate or synchronous data-flow graph (SDFG) first proposed by Lee [13] . The nodes of a SDFG represent functional elements, while edges between nodes represent connections between them. Each node consumes and produces a predetermined fixed number of delays (i.e., data tokens) on each invocation. Additionally, each edge may contain some initial number of delays. This model has proven popular with designers of signal processing programming environments [9, 11, 18, 23] with its use leading to numerous important results regarding the scheduling [7] , hierarchization [21] , vectorization [20] and multiprocessor allocation [8, 13] of DSP programs.
A great deal of research has been done attempting to optimize various aspects of an application's execution by applying various graph transformation techniques to the application's SDFG. One of the more effective of these techniques is retiming [15, 16] , where delays are redistributed among the edges so that hardware is optimized while the application's function remains unchanged. Retiming was initially applied to single-rate DFGs to optimize the application's schedule of tasks so that the clock period of the graph (i.e., the total computation time of the longest zero-delay path) was decreased in order for the application to be more efficiently scheduled for execution on multiprocessors [3] [4] [5] . It was later extended to the more general SDFG model in order to extend vectorization capabilities [25] or minimize the total delay count of a SDFG [24] . However, the problem to create the retimed graph in Figure 1 (b) with clock period . The function of the two graphs is the same, with the only complication being that we will have to provide the first value of £ when we begin execution. The costs of doing this are miniscule when we consider that we will be saving a clock cycle each time we execute the loop. An additional benefit is that scheduling alone usually yields a schedule requiring more resources than the schedule produced by retiming first. To illustrate this, consider the single-rate data-flow graph in Figure 2 (a). It is clear that this graph has a clock period of 4, and we can derive the schedule in Figure 2 (b) which has this clock period. Note that this schedule requires a minimum of 5 processing units to execute because of the work called for at any time-step greater than zero which is a multiple of 4. F  F H  H  H  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I   second On the other hand, suppose that we retime our graph to become the DFG of Figure 3 (a). This retimed graph permits us flexibility when scheduling node P , allowing us to compact our schedule and produce the one given in Figure 3 (b) which requires only 3 processors, a 40% reduction in resources required for execution.
The benefits are clear, but reworking our retiming methods so that they may be applied to synchronous graphs is not easy. The difference between the single-rate and multi-rate models lies in the specification of production and consumption rates on each edge; in single-rate graphs all such rates are assumed to be the same, whereas different rates for different edges are typically specified when constructing SDFGs. Two pitfalls were S  S  S  S  S  S  S  S  S  S  S  S  S  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T noted in [26] . First of all, a retiming may be derived for a single-rate DFG by solving a linear programming problem [16] . The introduction of rates on the edges potentially changes this to a more complicated integer linear programming problem. Second, the introduction of rates invalidates the traditional results regarding the delay counts of paths and cycles, depriving us of many useful results derived for the single-rate case. Specifically, in the single-rate case, we seek to remove zero-delay path with excessive total computation times. It isn't clear what we want to avoid in the multi-rate case; a specific delay count on one path may or may not be adequate, depending on what rates have been specified.
Finally, the most popular method for retiming SDFGs has been to translate the SDFG to its single-rate equivalent, retime this new graph, then translate back [10, 24] . There are two problems with this idea. First, as we will demonstrate, it may be impossible to translate a retimed single-rate graph back to a retimed SDFG. Second, even if this method works, the costs in performing the necessary translations and dramatically increasing our problem size may be prohibitive. It is clearly preferable to work with the original SDFG as much as possible.
In this paper, we will develop the basic definitions and results necessary for specifying and manipulating a SDFG and its single-rate equivalent. We will review retiming and point out the problems which arise when it is applied to SDFGs. We will propose a polynomial-time algorithm which retimes a given SDFG to have a specified clock period. Finally, we will demonstrate the effectiveness of our algorithm by applying it to several examples.
In the next section, we will formalize the fundamental concepts related to the study of synchronous dataflow graphs. We then discuss retiming and the problems we face as we apply it to SDFGs. Next is our retiming algorithm, followed by detailed examples. Finally, we summarize our work and point to future directions for study.
Synchronous Data-Flow Graphs
The concept of a synchronous data-flow graph was developed and used extensively by Lee and Messerschmitt [12] [13] [14] , but was not rigorously defined until the work of Zivojnovic et al [22, 24, 26] . In this section, we review their definitions and ideas in order to formalize these concepts.
Basic Definitions
A synchronous data-flow graph (SDFG) (sometimes called a multirate or regular data-flow graph) is a finite, directed, weighted graph
where:
1.
is the vertex set of nodes or actors, which transform input data streams into output streams;
2.
is the edge set, representing channels which carry data streams;
is a function with v ¥ x w in the figure. Furthermore, the numbers at either end of the edge connecting In [13] it was demonstrated that a repeating sequential schedule can be constructed for a SDFG if the rank of the graph's topology matrix is one less than the number of nodes in the SDFG. (The reverse is not necessarily true, as we will see shortly.) If this condition holds there is a positive integer vector Figure 5 (c). It is clear that, if we attempt to execute this circuit, each node will fire once before node £ deadlocks the system waiting for its second token. 
Constructing an Equivalent HDFG
In order to study an SDFG, it is sometimes useful to create its equivalent homogeneous data-flow graph (EHG). As the name implies, an EHG performs the same function as the original SDFG, but is constructed so that each edge carries at most one token. Since each node is expecting to either produce or consume more data than this, an EHG compensates by inserting multiple edges between nodes.
An algorithm for creating a graph's EHG appears as Algorithm 1 below. It is adapted from the method of [1] for constructing the EHG of cyclostatic DFGs, which not only permit multiple tokens to pass along edges but also specifies the pattern of their production or consumption. The algorithm first creates enough copies of each node to satisfy the specifications of the BRV. It then inserts edges. If nodes in a SDFG are connected by a zero-delay edge, then the first data token produced by the first copy of the source must be consumed by the first copy of the sink in the EHG. If there are delays on an edge, the data contained here is consumed first, so that the first new token produced is in fact needed by a later copy of the sink. The algorithm determines which copies of source and sink to map to one another based on how much data has been created and used. As an example of our algorithm in action, the EHG of Figure 4 (a) appears in Figure 6 .
There are two significant differences between our algorithm and that of [1] . First, the original algorithm was more concerned with making sure that the amount of data produced and consumed on an edge matched. This yields a simpler but more confusing graph. For purposes of clarity, we do not combine edges between nodes. If multiple tokens are to be sent between nodes in the EHG, each travels along its own edge. One benefit is that the delay counts between the original SDFG and the EHG match in our model. More significantly, the original algorithm also inserted control dependencies into the EHG, insuring that all copies of a node execute serially. Since we are concerned with maximizing parallelism, we concern ourselves only with the necessary data dependencies.
Finally, as derived in [1] and [6] , we will say that a SDFG is live if its EHG has no zero-weight cycles. Otherwise the graph is deadlocked. An example of a consistent deadlocked graph appears as Figure 7 contains no delays, and so it is impossible to schedule them since each must precede the other. It should be clear that a SDFG must be both live and consistent in order for it to have a repeating static schedule. 
Retiming
A great deal of research has been done attempting to optimize the schedule of an application's tasks after applying various graph transformation techniques to the application's HDFG. One of the more effective of these techniques is retiming [15, 16] , where delays are redistributed among the edges so that the application's function remains the same while the execution time decreases. Despite its usefulness when applied to HDFGs, the application of retiming to SDFGs was explored only marginally prior to 1994 [10, 17] before being studied by Zivojnovic et al primarily as a way to minimize the delay count of a SDFG [24, 26] . In this section we intend to review the basics of retiming, explore some of the pitfalls which arise when studying retiming of SDFGs, demonstrate the effectiveness of retiming, and propose two algorithms for retiming SDFGs.
Basic Definitions
A path in either a SDFG or a HDFG is any sequence of nodes and edges. The clock period É Ê ¥ © of a HDFG is then defined to be the length of the longest zero-delay path [2] . This definition is problematic on two counts. First, it is not clear what the delay count of a path in a SDFG really is in light of the inconsistencies in the results from [24] . Second, suppose that we attempt to apply our definition directly to SDFGs, as demonstrated by Figure 7 (a). We would conclude that the clock period equals , but in reality the graph must have an infinite clock period because of the problems scheduling nodes . Thus, we are forced to define the clock period of a SDFG to be equal to the clock period of its EHG. As an example, the clock period of the SDFG in Figure 4 . This can be clearly seen from the schedule in Figure 5(a) , where overlapped iterations create higher throughput. (The iteration period of an SDFG can be overestimated using the ideas from [22] without constructing the EHG, but our method yields a tighter bound, which is important as we attempt to minimize the iteration period of an SDFG next.) 
Problems Retiming EHGs
On first glance, it appears that we should just be able to retime the EHG via traditional methods and then map back to the original SDFG, as was done by Lee originally [10] . Unfortunately, the initial translation from SDFG to EHG is too complex to permit this. As an example, consider the unit-time SDFG given in Figure 9 (a), with its EHG appearing in Figure 9 
Retiming a SDFG
Since we cannot retime a SDFG by working with its EHG, we must develop methods for retiming the SDFG directly. In this section we refine the methods of [16] to deal with this situation.
Initial Problems
Unfortunately, the retiming algorithms we will propose will either be pessimistic or expensive. The reason for this is that the original methods we are using as a basis were themselves built on one result from [16] : The problem now is that insufficient delays along a path in a SDFG do not necessarily translate into a zero-delay path in the EHG. As an example, consider the unit-time SDFG in Figure 10 (a) below, with its EHG given in Figure 10 (b). For Ì , examining only the original SDFG would lead us to retime this path even though such an exercise is unnecessary. To avoid such false paths, we may need to construct intermediate EHGs for study, a very costly process.
In a similar vein, the nature of an EHG raises the question of what a path actually is. The traditional definition says that a path is a sequence of nodes and edges. Since we now have multiple edges between nodes, we must be very careful to consider all paths resulting from such multiple copies. To illustrate, the traditional definition would dictate that there is one path from i n the EHG when we do our calculations below. While this makes sense, it is somewhat different from what has always been done and so must be noted.
Another additional cost that the problem of insufficient delays forces us to pay comes in the form of additional checks for legality. In the original algorithms from [16] , only one delay at a time was moved, a stipulation which did not cause the proposed retiming to become illegal at any intermediate step (as proven in [16] ). Because we are now pulling groups of delays through nodes, this situation no longer exists, and so we will have to check for legality at every stage of an algorithm.
The question now is to determine exactly how many delays to view as sufficient. Let
be an edge in a SDFG. Each copy of
Retiming Algorithm
We will seek our retiming via relaxation on the edges of our graph. We do this by sorting our vertices and then sweeping along the sorted list. When we get to a point where the current path is too long, we insert enough delays so as to break the path up into sufficiently small pieces. We then verify that we are allowed to do this. If we can't then there is no retiming and we return with an error; otherwise we sweep further. Once our prospective retiming has been found, we test the retimed graph to make sure that the clock period is within our requirements. If it is we've found a way to retime the SDFG; otherwise there is no such retiming.
We begin our construction by considering Algorithm 2 below, the % ¥ P © -time algorithm from [16] for finding the length of the longest zero-delay path into each vertex of a HDFG. This procedure first sorts the vertices so that those occurring early in the list are connected to vertices later in the list by zero-delay edges. It then traces through the list assigning each vertex the length of its longest zero-delay path. If a vertex is not connected to a previous one, its path length must equal its own computation time; otherwise its path length equals its own time, plus the sum of the times of all the other vertices found along the path to this point. We require this algorithm not just for constructing our retiming, but also for verifying that our final retimed graph executes within the required time frame. 
&
With this in hand we can now proceed to our primary method, given as Algorithm 3 below. We begin by retiming our SDFG with the result to date and constructing its EHG. The EHG is then handed to Algorithm 2 to find the lengths of the maximum paths to all vertices. At this point the vertices in our SDFG fall into one of two groups. If all copies of a vertex in the EHG are isolated (i.e., connected to the rest of the graph only by edges containing delays), we don't wish to retime the node and remove it from consideration. Otherwise the node lies along some zero-delay path and we may have to retime it. In this case we assign it a longest path length equal to the longest path length of any of its copies in the EHG.
At this point we consider nodes for retiming. Since we want to push delays forward along our paths (rather than pulling them backward as was done in [16] ), we retime those nodes which occur early in a path. This . Therefore, for each outgoing edge from such a node, we calculate the number of delays needed to retime all copies of the edge in the EHG, subtract the number of delays currently on the edge, adjust for the different rates of production and consumption, and retime by the maximum of these needs. Once all nodes are retimed, we test the prospective retiming for legality, i.e. we check that retiming by our function doesn't result in some edge containing a negative number of delays. If we pass this test, we look further along our path for other nodes in need of retiming.
Once we have checked all nodes at least once and have derived a legal retiming function, it is time to test our answer. We retime the SDFG, construct its EHG, and pass this to Algorithm 2 to find the maximum zero-delay paths to each node. Since the length of the largest zero-delay path in the EHG equals our clock period, this length is tested against our requested clock period. If the length is still too large we cannot retime this SDFG to execute in the time we wish and must return with an error. Otherwise we have found our retiming.
We now demonstrate our method by executing it on the SDFG of Figure 11 t ime. However, while we suspect that its success is both a necessary and sufficient condition for a SDFG to be retimable to a given clock period, it is unknown whether or not this is the case. In our defense, the algorithm from [16] upon which this method is based was also never proven both necessary and sufficient, but has been extremely useful in practice. We suspect that the algorithm we've described here will prove just as valuable despite this logical gap. 
A Simple Example
To illustrate our method further, consider the SDFG in Figure 15 , a variation on the example from [17] We still have a pass of the algorithm to perform. This time the nodes are separated from each other by delays, and so times was , this is also the clock period of our retimed graph.
Example: A Simplified Spectrum Analyzer
Finally, let us apply our algorithm to a variation of the simplified spectrum analyzer from [24] which appears in Figure 17 (a), with node descriptions in Figure 17(b) . This graph has a BRV of
, so in the interests of space we will not display the EHG at each step. Instead, we shall describe the pertinent information. It can be shown that the lower bound on the clock period for this SDFG is , and so we will attempt to retime it to be optimal. At first, we see the zero-delay paths reveals a pattern to which we will refer again. If there are sufficient delays on an edge, the value of the retiming for the edge's source node will not change.) Applying this retiming to the graph in Figure 17 (a) yields the graph of Figure 18 (a). If our algorithm were rewritten so that this graph were checked, we would find that just this initial step has resulted in an optimal retimed graph. Regardless of how good our algorithm may be, it is still not proven to represent both a necessary and sufficient condition for retiming. This proof, or the construction of an alternate method which is necessary and sufficient, remain interesting open problems. Correcting the errors in [24] will definitely lead to greater understanding of our model and may open the door to removing this logical gap. It may also lead to a study of retiming applied to even more complicated models, such as cyclo-static or dynamic DFGs [1] .
