Abstract-Scheduling and retiming are important techniques used in the design of hardware and software implementations of digital signal processing algorithms. In this paper, techniques are developed for generating all scheduling and retiming solutions for a strongly connected data-flow graph, allowing a designer to explore the space of possible implementations. Formulations are developed for two scheduling problems. The first scheduling problem assumes a bit-parallel target architecture. The formulation for this problem is general because it considers retiming the dataflow graph as part of scheduling, and this formulation reduces to the retiming formulation as a special case. The second scheduling problem assumes a bit-serial target architecture. Based on these formulations, the conditions for a legal scheduling solution are derived, and a systematic technique is presented for exhaustively generating all legal scheduling solutions for a strongly connected data-flow graph. Since retiming is a special case of scheduling, this systematic technique can also be used for exhaustively generating all legal retiming solutions. A technique is also developed for exhaustively generating only those bit-parallel schedules which satisfy a given set of resource constraints. The techniques for exhaustively generating scheduling and retiming solutions are demonstrated for several filters. For example, we show that a simple filter such as the biquad has 224 possible retiming solutions for a latency of one time unit. We also show that a fifth-order wave digital elliptic filter has 4.7 million and 580 million scheduling solutions for iteration periods of 17 and 18, respectively.
the importance of time scheduling and retiming cannot be overstated.
This paper presents new formulations of the time scheduling and retiming problems, and, based on these formulations, new techniques are developed to determine the solutions to these problems [2] . (From this point forward, we shall refer to time scheduling as simply scheduling.) These formulations are valid for strongly connected (SC) graphs, where a strongly connected graph has a path and a path for every pair of nodes in the graph. We focus on SC graphs because these graphs traditionally present the greatest challenges when they are mapped to physical realizations due to the feedback present in the graphs. An example of an SC DFG is the fifth-order wave digital elliptic filter [3] in Fig. 15 which is commonly used as a benchmark for demonstrating high-level synthesis techniques.
Scheduling consists of assigning execution times to the operations in a DFG such that the precedence constraints of the DFG are not violated. A great deal of literature exists on the topic of scheduling in the context of highlevel synthesis for ASIC design for DSP applications [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] ; however, none of these works gives a formal definition of scheduling along with systematic techniques for exhaustively generating the solutions to the scheduling problem. Integer linear programming (ILP) techniques use formal definitions of scheduling [10] , [11] , [17] , but these techniques generate only one solution. This paper presents new scheduling formulations and algorithms for exhaustively generating all of the solutions to the scheduling problem. Generating all scheduling solutions is a theoretically interesting result which can also have practical applications. Two scheduling problems are considered in this paper, namely, scheduling for time-multiplexed execution on bit-parallel architectures and scheduling for execution on bit-serial architectures.
Retiming consists of moving delays around in a DFG without changing its functionality. As with scheduling, there is a huge body of literature on retiming, and new applications for retiming are constantly being found. For example, due to the recent demand for low-power digital circuits in portable devices, some recent work has focused on retiming for power minimization [21] . The groundbreaking paper on retiming [1] describes algorithms for tasks such as retiming to minimize the clock period and retiming to minimize the number of registers (states) in the retimed circuit. An approach to retiming which is based on circuit theory can be used to generate all retiming solutions for a DFG [22] . This approach was the motivation for our work on exhaustive scheduling. In this paper, we show that retiming is a special case of scheduling, and consequently, the formulation of the scheduling problem and the techniques for exhaustively generating the scheduling solutions can also be applied to retiming.
The impact of the formulations derived in this paper are as follows.
• The interaction between retiming and scheduling is important [7] , and our formulations provide a simple way of observing this interaction.
• We show that retiming is a special case of scheduling.
• We derive mathematical descriptions of the scheduling and retiming problems in a common framework.
• We develop techniques for generating all solutions to a particular scheduling or retiming problem. This gives a developer the ability to search the design space for the best solution, particularly when various parameters are difficult to model and include in a cost function. This has applications to software design, ASIC design, and design for reconfigurable hardware implementations.
• Our formulations provide for a better understanding of scheduling and retiming which can be used to develop new heuristics for these problems. Many of the results in this paper rely upon graph theory. Section II gives a review of some results from graph theory along with an algorithm for finding the linearly independent loops in an SC directed graph. Our formulations for scheduling to bit-parallel and bit-serial architectures are given in Section III along with an explanation of how retiming can be viewed as a special case of scheduling. Section IV contains the description of a systematic technique used to exhaustively generate the scheduling and retiming solutions. Section V describes two techniques for exhaustively generating the schedules which satisfy a given set of resource constraints for a bit-parallel architecture. Section V includes the results of scheduling the fifth-order wave-digital elliptic filter in Fig. 15 with and without resource constraints. Our conclusions are given in Section VI.
II. REVIEW OF GRAPH THEORY
This section provides a brief review of graph theory. Most of the definitions and results in this section can be found in [23] .
In this paper, we are concerned only with directed graphs. A directed graph is represented as , where the following is true.
• is the set of vertices (nodes) of . The vertices represent computations. The number of nodes in is .
• is the set of directed edges of . A directed edge from node to node is denoted as . The edges represent communication between the nodes. The number of edges in is .
• is the number of delays on the edge , also referred to as the weight of the edge.
• is the computation time of the node .
A directed path is denoted as . A simple path is a path with distinct edges, and an elementary path has distinct nodes. A cycle is a closed path (i.e., ). A simple cycle has distinct edges and an elementary cycle has distinct nodes. An elementary cycle in a directed graph will be referred to as a "loop" in this paper.
A directed graph is strongly connected if, for every pair of vertices , there exists a path and . A directed spanning tree is a subgraph of which has a root node and a path for all except . The directed spanning tree contains no cycles. A directed spanning tree contains exactly nodes and edges. An edge of a directed spanning tree is called a branch, and the edges of not included in the tree are called links. Every SC graph contains a directed spanning tree.
An edge from to is incident with vertices and . More specifically, is incident from and incident into .
The set operations such as union, intersection, difference, complement, etc., are operations on the edges of a graph. Let and be two subgraphs of a connected graph . consists of all edges in or (or both) and the vertices incident with these edges.
is formed by removing all edges in from , and then removing all vertices with no incident edges.
Graphs can be represented using matrices. In this paper, vectors are represented using bold lowercase and matrices are represented using bold uppercase . The th element of is denoted as or , and the th element of is denoted as . The matrix full of zeros is denoted as , or simply as when its dimensions do not need to be explicitly stated, and similar notation is used for the matrix full of ones.
Let be the oriented incidence matrix of . This matrix, which has dimensions , is defined as is incident from is incident into is not incident with and . The reduced oriented incidence matrix is defined to be any rows of . has dimensions and . Let be the fundamental loop matrix of the SC graph . This matrix, which has dimensions , is defined as if edge is in loop otherwise
The rows of are linearly independent, and the loops in which are represented by these rows are referred to as the linearly independent loops in . The remaining loops in , which are not represented by the rows of , are said to be the linearly dependent loops in because these loops can be represented as linear combinations of the rows of . An SC graph contains exactly linearly independent loops.
Two important relationships between the fundamental loop matrix and the oriented incidence matrix are and . Each of the rows of the fundamental loop matrix corresponds to a linearly independent loop. Algo- The loops denoted as , form a basis for the loops in the strongly connected graph .
Algorithm FFL maintains a subgraph which initially consists of the root node of the directed spanning tree . During iteration , a link in which is incident into a node in is chosen in STEP 1. In STEP 2, a loop denoted as is found in the subgraph consisting of the link and the edges in . is then updated at the end of the iteration. The most computationally demanding step in this algorithm is finding in STEP 2, which requires operations [24] . Therefore, Algorithm FFL can be computed in polynomial time.
We construct the fundamental loop matrix by letting from Algorithm FFL be the th row of . The edges in the graph are numbered such that the first columns of correspond to the branches of the spanning tree of , and the remaining columns correspond to the links. The link is assigned to the th column of . By constructing the fundamental loop matrix in this manner, it has the form (1) where is an matrix and is an lower triangular matrix with ones on the diagonal. Note that the columns of correspond to the links of while the columns of correspond to the branches of . Because of its form, has rank . Adding more loops of to (adding a loop would consist of adding a row to ) does not increase its rank. Therefore, the rows of form a basis for the loops of . An alternative method of constructing is to find all of the loops in , using an algorithm such as the one given in [24] , and then choosing linearly independent loops from these as the rows of .
Example 2.1: This example uses Algorithm FFL to form the fundamental loop matrix for the graph in Fig 1. The spanning tree with node 1 as the root node is shown in Fig. 2(a) . At the start of Algorithm FFL is node 1. During iteration , the only possibility for link is edge 6. The only possibility for is . is circled in Fig. 2(b) . During iteration , there are two possibilities for link , namely, edges 7 and 8. Choosing edge 7 as results in . is circled in Fig. 2(c) . During iteration , the two possibilities for link are edges 8 and 9. Choosing edge 8 as results in .
is circled in Fig. 2 
(d). During iteration
, link is edge 9, and is . The fundamental loop matrix is Note that has the desired form as given in (1) . Row corresponds to from Algorithm FFL and column corresponds to edge of .
III. SCHEDULING AND RETIMING FORMULATIONS
Time scheduling (or simply scheduling) consists of assigning execution times to the operations in a DFG such that the precedence constraints of the DFG are not violated. This section considers two scheduling problems, namely, scheduling to a time-multiplexed bit-parallel target architecture (we call this bit-parallel scheduling) and scheduling to a bitserial target architecture (we call this bit-serial scheduling). It turns out that the bit-parallel and bit-serial scheduling formulations are quite similar, and the retiming formulation is a special case of bit-parallel scheduling.
A. Bit-Parallel Scheduling
In bit-parallel scheduling, a DFG is statically scheduled to a bit-parallel target architecture. The scheduling formulation presented in this section is based on the folding equation developed in [25] . Folding is the process of executing several algorithm operations on a single hardware module. Scheduling is the process of determining at which time units a given algorithm operation is to be executed in hardware.
Before the scheduling formulation is developed, we need a brief description of retiming. The basic retiming equation for the edge is [1] (2) where is the number of delays on the edge before retiming, is the number of delays on the edge after retiming, and and are the retiming values of nodes and , respectively. The notions of an iteration and an iteration period are used in this section. An iteration is defined as the execution of each node in the DFG exactly once. The iteration period is defined as the number of clock cycles used to execute one iteration of the DFG in hardware.
Consider an edge from node to node , denoted as . The operations (nodes) in the DFG are scheduled to be executed in the folded architecture once every clock cycles, where is the iteration period. Let the th iteration of nodes and be executed in hardware at time units and , respectively, where and are the time partitions to which the nodes are scheduled to execute such that . Let edge have delays, which means that the result of the th iteration of node is used by the th iteration of node . The hardware modules which execute nodes and are denoted as and , respectively. If is pipelined by stages, then the result of the th iteration of node is available at . This sample is used by the th iteration of node , which is executed by at , so the sample must be stored for clock cycles. Substituting for using (2) gives
The edge with delays in the DFG maps to an edge from to with delays in the architecture, and the data on this edge are switched into at time units . Note that we assume that the hardware module is pipelined by delays, where is the computation time of the node in the DFG. If we define an vector whose th element is the computation time of the source node of edge (the source node of an edge is the node that the edge is incident from), then the folding equation can be written for all edges of the DFG simultaneously using (4) where is the incidence matrix for the graph (see Section II), is the time partition vector which assigns node to the time partition is the retiming vector with the retiming values of the nodes in is and contains the number of delays on each edge of is the folding vector which contains the number of delays on each edge of the folded architecture, and is the delay vector as previously described. This formulation of folding is general because it relies upon the retiming solution and the time partition vector . One way to view this is that the DFG is preprocessed using retiming (hence the vector) and then scheduling is perfomed on the retimed DFG (hence, the vector). Combining and using results in the schedule vector . Using , the scheduling problem can be written as (5) The rank of the incidence matrix is . Therefore, the left nullspace of must consist of a vector which satisfies . We can see that because each column of contains exactly one entry which is a 1, one entry which is a , and the remaining entries of the column are zero.
Using the relationship , we can write which means that adding the constant to each element of the schedule vector does not change the number of delays on the edges of the folded architecture. The incidence matrix can be written as
The reduced incidence matrix consists of any rows of . Removing row of results in (6) The reduced incidence matrix has dimensions and rank . The reduced scheduling vector is defined as (7) which can be written as , where and are the time partition vector and the retiming vector with the th elements removed. Using and , we can write Substituting this into (5) results in (8) Node is called the reference node. Since replacing by does not alter the resulting folded architecture, we can choose so . After replacing with , (8) becomes . Throughout the remainder of this paper, we will assume that so . In an abuse of notation, we will refer to simply as so that (5) can be written as (9) Lemma 3.1: Equation (9) can be solved for if and only if .
Proof: Equation (9) is in loop and otherwise. Therefore, is the total number of folded delays on loop , and is a constant that depends on . The equation states that the number of folded delays on loop is the same for any legal folding vector , and implies that this is true for all linearly independent loops of represented by the rows of . Furthermore, the sum of the number of folded delays for all edges and pipelining delays associated with all nodes of a loop is the product of the folding factor and the number of loop delay elements, as noted in [25] . It can also be shown that this holds for the dependent loops of , i.e., the number of folded delays on each loop of that is not represented by a row of is the same for any legal folding vector .
If holds, (9) has exactly one solution for , which is given by (10) The above discussion can be summarized by saying that the number of folded delays on each loop in is the same for any valid schedule .
In addition to the condition there is also the practical condition that the number of delays on an edge in the folded architecture must be nonnegative. This condition can be written as . The constraints for a valid schedule are: 1) ; 2) .
B. Retiming
Retiming is the process of moving delays around in a circuit without changing the functionality of the circuit [1] . A brief description of retiming is given in Section III-A. This section describes how retiming can be viewed as a special case of bit-parallel scheduling.
The folding equation for a graph is given in (4) . If each node in represents a hardware operator, then all operations in the graph are executed in a single clock cycle resulting in an iteration period of . The elements of the time partition vector are all zero because time partition zero is the only available partition. If we let , i.e., we do not consider any internal pipelining of the operators, (4) becomes which simplifies to (11) Since is the number of delays in the folded architecture, is equivalent to for , so (11) becomes (12) which is simply the matrix notation for writing (2) simultaneously for all edges of the graph. This demonstrates that retiming is simply scheduling when the iteration period is unity. Using , (12) can be written as If is a retiming vector which maps the graph to the retimed graph , then so is for any integer . In the context of retiming (i.e., assuming , and ), (9) can be written as (13) Recall that (9) assumes that . Since and is assumed to obtain (13), this implies that in (13) . In other words, the retiming value of the reference node is 0 in this formulation.
The translation of Lemma 3.1 to the retiming context is that (13) has a solution if and only if holds. This implies that the number of delays on any loop in remains unchanged during retiming, as noted in [1] . If holds, (13) has exactly one solution for , which is given by (14) In addition to the condition , there is also the practical condition that the number of delays on an edge in the retimed graph must be nonnegative. This condition can be written as . The conditions for a valid retiming from to are: 1) ; 2) .
C. Bit-Serial Scheduling
In this section, a scheduling formulation is developed where the target architecture is a bit-serial architecture. This formulation, which is similar to the formulation in [26, Ch. 6] , has the same general form as the retiming and the bit-parallel scheduling formulations in Sections III-A and III-B.
A bit-serial operator is often represented using a timing diagram such as the one in Fig. 3 . Let the execution of operator in this figure begin at time . The first bit of each of the inputs , and arrives at time units , , and , respectively. The first bit of each of the outputs and is produced at time units and , respectively. In other words, the timing diagram gives the relative differences between the timing of the input and output samples of the operator. The timing diagram for this architecture. Example 3.1 For the bit-serial adder in Fig. 4(a) , which computes , the timing diagram is shown in Fig. 4(b) . Note that is the wordlength. The constraints for the bit-serial scheduling problem can be derived using the timing diagram. Consider the edge with delays in Fig. 5 . The output of iteration of is used as the input of iteration of . Let the th iteration of nodes and begin execution at time units and , respectively, where is the data wordlength and and are the time partitions to which the nodes are scheduled to execute such that . The output of the th iteration of is available at and the output of the th iteration of is consumed at , so the result must be stored for clock cycles.
This equation can be written for all edges of the graph simultaneously according to (15) 
where and are defined as in (6) and (7), and the scheduling value for the reference node is . Using the same argument as in Lemma 3.1, it can be shown that the bit-serial scheduling (16) has a solution if and only if . The equation states that the sum of the serial delays in any loop of the hardware implementation is the same for any valid serial delay vector . In addition, the sum of the number of serial delay elements of all edges and latencies associated with all nodes in a loop is the same as the product of the word length and the number of loop delay elements.
A second constraint, , exists because a connection in hardware cannot have a negative number of delays. The constraints for a valid bit-serial schedule are: 1) ; 2)
. The value of the schedule vector can be found using (17) IV. GENERATING ALL SCHEDULING AND RETIMING SOLUTIONS
A. Generating All Bit-Parallel Scheduling Solutions
Based on the two constraints and , all scheduling solutions for a strongly connected DFG can be generated. A systematic technique for generating these solutions is presented in this section.
Recall that is the fundamental loop matrix which can be expressed as , where is an matrix and is an lower triangular matrix with ones on the diagonal. The columns of correspond to the branches of the spanning tree of which is chosen before Algorithm FFL is used to find , and the columns of correspond to the links of . The rows of correspond to linearly independent loops in .
The algorithm for generating all scheduling solutions requires an interval to be written for the folded weight of each branch of and an equality to be written for the folded weight of each link of . The interval for the folded weight of a branch gives the range of possible values for the number of folded delays for this branch in the folded architecture. The equality for the folded weight of a link gives an expression for the number of delays for the link in the folded architecture. Using these intervals and equalities, code can be constructed to generate all possible scheduling solutions.
To determine these intervals and equalities, the elements of the fundamental loop matrix are examined one-by-one in a row-by-row manner, starting at the top-left of the matrix. Each time a "1" is encountered in the submatrix of such that this "1" is the first "1" encountered in its column, an interval is specified for this branch. This interval, which represents the range for the number of folded delays for the branch in the folded architecture, takes into account the intervals and equalities previously determined in the row-by-row scan of .
Assume that the first "1" in column of is in row , i.e., and for all . Let denote any row of such that , i.e., is a fundamental loop that contains the edge . Since is the first "1" in column must hold, i.e., is in row or in a row which is below row . From , we get (18) Let denote the set of edges encountered before reaching the element in the row-by-row scan of . Mathematically, is the set of edges such that there exists an element such that . Using , we can rewrite (18) as (19) The intervals and equalities for the edges in the set have not yet been determined; however, we do know from that . Using this in (19) results in Using this along with specifies the interval for (20) which must hold for all such that . Because the matrix in is lower triangular with ones on the diagonal, the diagonal element of row , is always the first "1" encountered in column of during the row-by-row scan of . In addition to using to denote this element, it can also be denoted as where . When is encountered in the row-by-row scan of such that , an equality is written for based on the equation . This equality, which uses the fact that the intervals and equalities have already been determined for all edges in except edge , is (21) To summarize the above discussion, the matrix is scanned in a row-by-row manner starting with . When is encountered, if is the first "1" in its column of , the interval in (20) is written for all such that . When is encountered where , the equality in (21) is written.
The intervals for the branches of are denoted as for . An algorithm written in pseudocode for determining these intervals for the branches and the equalities for the links is given below. At any point in this algorithm, is the set of edges in whose intervals or equalities have previously been determined. Comments for the algorithm are written using the convention of the C programming language. See Algorithm IE at the From the intervals and equalities, code can be written to enumerate all possible scheduling solutions. The general structure of the code is the following. 1) Write FOR loops for the intervals and write assignment statements for the equalities in the same order that these intervals and equalities are generated in Algorithm IE. The code for finding all scheduling solutions is shown at the bottom of the page.
There are 12 scheduling solutions for this DFG. The scheduling vector can be computed from the folded edge vector using (10) . Using node 1 as the reference node, the folded edge weights and the scheduling values for the nodes are listed in Table I .
These four steps give the valid schedules for . The retiming vector corresponds to a valid retiming solution for , and the elements of the partition vector satisfy . For each legal folding vector , the technique in this section finds exactly one schedule , which contains information about the time partitions and the retiming values of the nodes. However, there are actually schedules which map the DFG to a folded architecture which has delays on its edges. We call these solutions equivalent schedules, and we call the solution found using Step 2 above the fundamental schedule of the folding vector . The equivalent schedules are for . Replacing by has two effects. First, the switching instance in the folded architecture becomes . Second, if scheduling is viewed as preprocessing the DFG by retiming (finding ) and then assigning time partitions (finding ), the preprocessed DFG may change because may change. A nice property of the technique presented in this section is that it finds the fundamental schedule for each folding vector , and the equivalent schedules are implicitly known to be for .
B. Generating All Retiming Solutions
Since retiming is a special case of scheduling, the techniques in Section IV-A for generating all scheduling solutions can also be used to generate all retiming solutions by replacing with and letting and .
Example 4.2:
In this example, we generate the edge intervals and equalities for the graph in Fig. 6 . The fundamental loop matrix for this graph is the weight vector is and . The intervals and equalities are generated in the following order using Algorithm IE:
Using these intervals and equalities, the code which generates all retiming solutions for the DFG in Fig. 6 is shown at the bottom of the page. Note that x is used to represent . There are twelve retiming solutions for the DFG. The retiming vector is computed from the retimed weight vector using (14) and , where node 1 is the reference node. The retimed edge weights and the retiming values for the nodes are listed in Table II .
If a DFG is not strongly connected, it is possible to add edges to the DFG to make it strongly connected so all retiming solutions can be generated. Consider the biquad filter in Fig. 7(a) . This graph is not strongly connected because, for example, there is no path from the output node to the input node. To make this graph strongly connected, it can be modified by adding an edge from the output node to the input node as shown in Fig. 7(b) . The modified graph has a new loop which has one delay. This loop forces the latency of the DFG to be one cycle. Using the techniques presented in this section, we find that there are 224 retiming solutions for the DFG in Fig. 7(b) .
As another example, consider the correlator in Fig. 8 which is used to demonstrate retiming in [1] . Using the techniques presented in this section, 143 retiming solutions can be found for this DFG. This result was also reported in [22] . 
C. Bit-Serial Scheduling
Since the bit-serial scheduling formulation has the same form as the bit-parallel scheduling formulation, the techniques used to generate all bit-parallel scheduling solutions can be used to generate all bit-serial scheduling solutions by replacing with and replacing with . The values of and can be computed from (recall that ) using and . It can be shown that these expressions for and result in the following.
• . This means that is indeed a time partition satisfying . • and if for all edges , as shown in Fig. 5 . This means that is a valid retiming solution of when for all . Example 4.3: In this example, we generate all possible schedules for the bit-serial implementation of the third-order all-pole filter shown in Fig. 9 assuming two's complement number representation, data wordlength is 8 (i.e., 8) , and coefficient wordlength is 4.
The first step is to determine the timing diagram for each operator. The circuit and timing diagram for an adder are given in Fig. 4 . The circuits and timing diagrams for multiplication by , and are given in Fig. 10(a)-(c) , respectively. Using these subcircuits, the timing diagram for the filter is shown in Fig. 11 . The fundamental loop matrix is
In addition, we have , and . The equation The complete architecture for this solution is shown in Fig. 12 . This architecture uses 20 registers, not including the registers which are internal to the processing units.
V. BIT-PARALLEL SCHEDULING WITH RESOURCE CONSTRAINTS
When all of the schedules are generated for a DFG, this may include many schedules which require more hardware resources than are available for the implementation. In this section, we describe two methods for finding the schedules which satisfy a given set of resource constraints. In the first method (the solution-save method), we generate all scheduling solutions and then save only the solutions which satisfy the resource constraints. In the second method (the solution- generate method), we only generate those scheduling solutions which satisfy the resource constraints.
A. The Solution-Save Method
The number of hardware modules required by a scheduled DFG can be determined from . For example, let be the number of multiplication operations scheduled to time partition , and let be the number of addition operations scheduled to time partition . Then the number of multipliers required by the schedule is and the number of adders is . Example 5.1: In this example we find all scheduling solutions which require one multiplier and one adder for the biquad filter in Fig. 7(b) assuming an iteration period of and assuming that addition and multiplication require 1 and 2 units of time, respectively. Nodes 1, 2, 7, and 8 are addition operations and nodes 3, 4, 5, and 6 are multiplication operations.
The fundamental loop matrix is and . The intervals and equalities are
There is a total of 625 valid scheduling solutions for this example; however, only six of these solutions use only one adder and one multiplier. Table III gives the folding vectors and schedule vectors for these solutions where node 1 has been arbitrarily chosen as the reference node, which forces in all six solutions. Adding 3 to each element of the schedule vector in the fourth solution in Table III results in . This new schedule vector corresponds to the folding vector for the fourth solution in Table III (recall from Section III-A that adding a constant value to each element of the schedule vector does not change the number of delays on the edges in the folded architecture). This schedule vector is used to fold the biquad filter in [25, Example 11] , and the resulting folded architecture is given in [25, Fig. 12(d) ].
Example 5.2: Consider the four-stage pipelined eighthorder all-pole lattice filter in Fig. 13 . Edge 11 has been added to this filter to make it strongly connected. For the iteration period 2, this filter has 450 scheduling solutions, and 99 of these schedules use two adders and two multipliers, where addition and multiplication are assumed to require one and two units of time, respectively. Of these 99 schedules, the minimum possible number of registers required for the implementation is ten, and only two of these 99 schedules use ten registers. These schedules are and . The minimum number of registers is computed using the techniques in [27] with the modification that the results reported here assume that, for a processor that is pipelined by stages, the pipelining registers cannot be used by output samples from other processors, while the results in [27] allow one pipelining register to be shared by other processors. For the iteration period , the filter in Fig. 13 has 910 910 scheduling solutions, and 10 083 of these schedules use one adder and one multiplier. Of these 10 083 schedules, the minimum possible number of registers required for the implementation is 11, and 21 of these 10 083 solutions use 11 registers.
B. The Solution-Generate Method
This section describes a technique for exhaustively generating only the bit-parallel schedules which can be implemented on a given set of hardware resources. Using this technique, we can avoid generating those schedules which use more resources than are available, and this allows us to generate the desirable schedules in considerably less time. The following theorem, which is stated without proof, is needed so we can construct in a manner that allows us to perform exhaustive bit-parallel scheduling with resource constraints. , and these branches form an elementary directed path which we shall denote as . As described in Section II, we construct the fundamental loop matrix by letting from Algorithm FFL be the th row of . The edges in the graph are numbered such that the first columns of correspond to the branches of the spanning tree of , and the remaining columns correspond to the links. From Theorem 5.1, we know that if there are branches in which are in , then these branches form the elementary directed path . In other words, if contains branches which have not appeared in previous loops, then these branches form a path. These branches are assigned to the next available columns of in the order that they appear in the path . The link is assigned to the -th column of . By constructing the fundamental loop matrix in this manner, it still has the form given in (1); however, it now allows us to use Algorithm IE to determine the schedule values of the nodes directly.
The interval for the scheduling problem is found by enforcing (20) for all such that . Assume that the edge is incident into node and incident from node , i.e.,
. From (5), the expression for the th folded edge weight is . Substituting this into the interval for gives for all such that . Solving for gives for all such that . To avoid confusion with the interval for (recall that we denoted this as ), the interval for is denoted as . This notation specifies that is an interval for the scheduling value of the node that edge is incident into. Let . Then the interval is simply the interval from Algorithm IE with added to the lower and upper bounds. We shall denote this as . Using the technique described in this section for constructing the fundamental loop matrix , Algorithm IE can be used to determine the intervals for the folded edge weights, and the intervals for the scheduling values for the nodes can be found using . Example 5.3: In this example, all possible scheduling solutions are generated for the DFG in Fig. 14 for an iteration period of 4 by generating the solutions for directly. The computation time for each node is assumed to be unity. Using the technique described in this section for constructing results in (22) Notice that the edge labels in Fig. 14 are different than those used in Fig. 6 . The labels have been changed so the column numbers of in (22) correspond to the edge labels in Fig. 14 . Using , the intervals are given in Table  IV . Note that in this table has been used to simplify the upper bounds of the intervals. The code for this example is shown at the bottom of the next page. The 12 solutions for generated from this code are the same as those listed in Table I .
By determining the values of the schedule vector directly rather than first determining the folding vector and then computing the schedule vector, we can generate only those schedules which can be executed using a limited number of hardware modules. This is done using a programming technique that avoids the solutions which use more resources than are available. For each operation type (e.g., addition or multiplication), an array of data elements is used such that there is one element for each time partition from to . Each data element contains the number of operations of a given type that is currently scheduled to that time partition. Each data element also keeps track of the next time partition in which the hardware resources for that particular operation type are not fully utilized. By keeping track of this information, when we generate a new schedule by incrementing the schedule value for a node, the node is scheduled to a time partition in which the hardware resources for the operation are not already fully utilized. The end result is that we do not generate the schedules that use more resources than are available, so we can generate all scheduling solutions for a given set of resource constraints much more quickly than if we find all possible schedules and keep only those schedules which satisfy the resource constraints. The advantages of including the resource constraints are demonstrated using the fifth-order wave digital elliptic filter shown in Fig. 15 . We assume that addition and multiplication require one and two units of time, respectively, and that hardware adders and multipliers are pipelined by one and two stages, respectively. The results of exhaustively generating the scheduling solutions without considering resource constraints are shown in Table V . The results of exhaustively generating the scheduling solutions which can be implemented on a given number of hardware adders and multipliers are shown on the left side of Table VI . From these tables, we can see that the time it takes to exhaustively generate only the scheduling solutions which satisfy a given set of resource constraints is orders of magnitude faster than the time it takes to exhaustively SOLUTION-GENERATE TECHNIQUE PRESENTED  IN SECTION V-B. THE LEFT PART OF THE TABLE CONSIDERS SCHEDULING TO THE MINIMUM POSSIBLE NUMBER OF ADDERS AND MULTIPLIERS FOR THE GIVEN ITERATION  PERIODS, AND THE RIGHT PART CONSIDERS SCHEDULING TO THE MINIMUM NUMBER OF ADDERS, MULTIPLIERS, AND REGISTERS FOR THE GIVEN ITERATION PERIODS. generate all scheduling solutions. The expressions in [27] can be used to compute the number of registers required by a given schedule. The results of this are shown on the right side of Table VI. Note that these results assume that internal pipelining registers cannot be shared between processors, while the results in [27] assume that internal pipelining registers can be shared between processors. The CPU times in Tables V and VI result from generating the scheduling solutions from C code running on a Sun Sparcstation 20.
The solution-generate method can exhaustively generate the scheduling solutions for a given number of multipliers and adders much faster than the solution-save method described in Section 5.1. This is demonstrated by scheduling the fourstage pipelined eighth order all-pole lattice filter in Fig. 13 for the iteration period under the constraints that one adder and one multiplier are available, where the adder is pipelined by one stage and the multiplier is pipelined by two stages. As reported in Example 5.2, there are 10 083 scheduling solutions. Finding these solutions requires 21 CPU s using the solution-save technique and only 0.33 CPU s using the solution-generate technique. These CPU times result from C code running on a Sun Sparcstation 20.
VI. CONCLUSION
Formulations have been presented in this paper for the bitparallel and bit-serial scheduling problems, and we have shown that the retiming formulation introduced in [22] is a special case of the bit-parallel scheduling formulation. Techniques have been developed and demonstrated for exhaustively generating all unique retiming and scheduling solutions for an SC DFG. These techniques allow a circuit designer to explore the space of possible implementations.
In addition to the technique for exhaustively generating all unique bit-parallel scheduling solutions, the solution-generate technique was also developed for exhaustively generating only the bit-parallel scheduling solutions which satisfy a given set of resource constraints. Our results indicate that the solutiongenerate technique can generate schedules in CPU times that are greater than two orders of magnitude faster than generating all solutions.
One advantage of the formulations presented in this paper is that they allow us to understand that retiming and scheduling, which are two problems that have traditionally been viewed as unrelated, are similar. These formulations also show that retiming is an important part of scheduling. Specifically, retiming is shown to be a special case of scheduling, and retiming is included in our scheduling formulations to make them general and to allow the role of retiming to be observed during scheduling. Table V shows some scheduling results for the fifth-order wave digital elliptic filter. Since this filter is often used to demonstrate scheduling techniques, the numbers in these tables provide some benchmarks for gauging the effectiveness of scheduling algorithms. These numbers indicate that the number of schedules increases dramatically as the difference between the iteration period and the iteration bound becomes larger. While generating all schedules is theoretically interesting, for practical applications our exhaustive scheduling techniques are most useful when the iteration period is at or near the iteration bound. The solution-generate algorithm reduces the computational complexity required to generate the solutions which use a given set of functional units; however, the number of schedules and CPU times required to generate these schedules can still be quite large as shown in Table VI for the fifth-order wave digital elliptic filter with iteration period . Future research topics include using the formulations developed in this paper as a basis for scheduling algorithms which use shorter CPU times to find a single optimal schedule based on speed, area, and power constraints.
Throughout this paper, we have focused on SC graphs. However, many DSP algorithms, such as finite impulse response (FIR) filters, have DFG's which are not strongly connected. One way to make these DFG's strongly connected is to add an edge from the output to the input, as was demonstrated for the biquad filter in Section IV-B and the fourstage pipelined eighth-order all-pole lattice filter in Section V-A. Another way to make a graph strongly connected is to add a dummy node and add an edge from each output to the dummy node and an edge from the dummy node to each input. Both of these methods add new edges to the DFG, and the designer can place delays on these edges. In general, adding delays to these edges increases the latency of the DFG and increases the number of delays in some of the loops, which in turn increases the number of retiming and scheduling solutions for the DFG. While this provides a qualitative description of how adding delays to the input and output edges affects the number of retiming and scheduling solutions, a quantitative solution to this problem is beyond the scope of this paper and is a topic of future research. This paper has focused on scheduling and retiming of DSP algorithms which can be represented using data-flow graphs. Some DSP algorithms can only be represented using control-flow graphs. Using our formulations to develop scheduling techniques for these algorithms is a topic of future research. Another topic of future research is to include more diverse resource constraints, such as interconnection costs, into these scheduling techniques.
