Abstract-We propose a transformation-based scheduling algorithm for the problem-given a loop construct, a target initiation interval and a set of resource constraints, schedule the loop in a pipelined fashion such that the iteration time of executing an iteration of the loop is minimized. The iteration time is an important quality measure of a data path design because it affects both storage and control costs. Our algorithm first performs an As Soon As Possible Pipelined (ASAPp) scheduling regardless the resource constraint. It then resolves resource constraint violations by rescheduling some operations. The software system implementing the proposed algorithm, called Theda.Fold, can deal with behavioral loop descriptions that contain chained, multicycle and/or structural pipelined operations as well as those having data dependencies across iteration boundaries. Experiment on a number of benchmarks is reported.
I. INTRODUCTION
BEHAVIORAL description specifies the sequence of A operations to be performed by the synthesized hardware.
Such a description is usually compiled into an internal data representation such as a Data-Flow Graph (DFG) [ll, [lo] , [27] . A DFG captures the data flow dependencies of the given behavioral description. Scheduling algorithms then partition this DFG into subgraphs such that each subgraph can be executed in one control step. Each control step corresponds to one state of the controlling finite-state-machine (FSM). Usually, it takes one clock cycle to perform the operations scheduled into a control step.
Within a control step, a separate functional unit is required to execute each operation scheduled into that control step. Thus, the number of operations in a control step indicates the total number of functional units required in that control step. If more operations are scheduled into each control step, more functional units are needed. This implies that fewer control steps are required to implement the design. On the other hand, if fewer operations are scheduled into each control step, fewer functional units are sufficient but more control steps are needed. Thus, scheduling is an important task in highManuscript received April 15, 1992 ; revised September 23, 1992 . This work was supported in part by the National Science Council of R.O.C. under Contract NSC-80-0404-E-007-20 and under Contract NSC-81-0404-E-007-115, and by the NSF under Grant MIP-8922851-01. This paper was recommended by Associate Editor R. Camposano. E E E Log Number 9213663.
level synthesis because it determines the trade-off between hardware cost and performance. A behavioral description is called a straight-line code if it contains neither conditional branches nor loop statements. Extensive research has been conducted for automatically scheduling straight-line codes subject to time constraints (e.g., [ll] , [21] , [23] ), resource constraints (e.g., [9]) or both (e.g., [181, 1311) .
A behavioral description often consists of loop statements or even the description itself is the body of an infinite loop. For example, a filter in many digital signal processing applications repeatedly executes the same set of operations on every sample of the input data stream as an infinite loop. Because the loop execution dominates the total execution time, optimizing the loop body is essential to the performance of a design.
However, scheduling a loop body is quite different from scheduling a straight-line code because of parallelism beyond the iteration boundaries. We use Fig. 1 to illustrate three different loop scheduling methods: sequential, unrolling, and loop folding. Assume that the loop consists of 12 iterations. In a sequential scheduling approach, an iteration is scheduled into C control steps. The total execution time is 12 x C clock cycles, as shown in Fig. l(a) .
The unrolling method unrolls several loop iterations to form a larger loop body before scheduling. Because a larger loop body provides a greater opportunity for compacting the schedule, this technique usually produces a shorter schedule. is less than that required in a sequential schedule.
The loop folding method overlaps the execution of successive iterations in a pipelined fashion (1) . An iteration may require IT > C clock cycles to finish, where IT is the iteration time. However, successive iterations are initiated every I I clock cycles, where 11 is the initiation interval.
Consequently, the total execution time is IT + (12 -1) x I1 clock cycles.
We use a second-order IIR filter example [16] shown in Fig. 2 to further illustrate the three loop scheduling methods. Fig. 2(b) and (c) are two graphs capturing the loop behavior described in Fig. 2(a) . The detail of the graph model will be presented in Section 111.
A sequential schedule with four control steps using two adders and two multipliers is shown in Fig. 3(a) . It is optimal because the length of the critical path (*a + +e + +g + +h) is four clock cycles, and a data-path needs at least two adders and two multipliers to carry out all eight operations in four control steps due to data dependencies. The control cost is four words assuming that one word is needed for each scheduled control step in the control unit. Fig. 3(b) shows an optimal schedule of the example after unrolling two iterations. In this schedule, two iterations consisting of eight additions and eight multiplications are scheduled into six control steps using two adders and two multipliers. The average time required to execute one iteration is reduced from 4 to 6/2 = 3 clock cycles. While the control cost is increased from four to six control words. Fig. 3 (c) shows a loop folding schedule of the same example using the same amount of hardware (i.e., two adders and two multipliers). Successive iterations begin their execution every two clock cycles. The average time for executing an iteration is approximately two clock cycles when the number of iterations is large. The control cost is six control words (two control words each for the head, body and tail of the loop).
This example demonstrates that loop folding is indeed an effective technique for exploiting parallelism beyond iteration boundaries.
In this paper, we propose a transformation-based algorithm for loop folding. Given a loop behavior, a target initiation interval, and a set of resource constraints. We schedule the loop in a pipelined fashion such that the initiation interval is achieved, the resource constraints are met, and the iteration time is minimized. Our algorithm consists of two phases: as soon as possible pipelined scheduling and rescheduling. During the first phase, we ignore all resource constraints. During the second phase, we resolve violations against the resource constraints by rescheduling some of the operations. A priority function is proposed to aid the rescheduler in operation selection.
The rest of this paper is organized as follows. In Section I1 we briefly survey some previous work. In Section 111 we present a graph model for capturing the behavior of a loop. An overview of the proposed algorithm is given in Section IV. The two phases of the proposed algorithm are described in Sections V and Section VI, respectively. Section VI1 reports some experimental results. Finally, we conclude this paper with a summary and some pointers to possible directions for future research in Section VIII.
PREVIOUS WORK
In the high-level synthesis community, several papers related to this work have been published over the past decade. Reference [ In the PLS system [17] a heuristic is added to the list-based scheduling. It schedules operations into a state of an iteration at a time using a list-based algorithm. However, before starting with a new iteration, it performs a backward scheduling on those already scheduled operations (as late as possible within each iteration) in order to make more functional units available in the earlier states. This leads to reduction in iteration time for most of the benchmarks. PLS can schedule loops with inter-iteration data dependencies.
ATOMICS [I31 folds the loop iteratively starting with an inferior (long iteration time) schedule. In successive "folding iteration steps," certain operations are first folded in a heuristic way. The resulting DFG is then scheduled using a constraint projection and scheduling technique. Based on an evaluation of the resultant schedule, the proposed folding may be adjusted in a later iteration step in order to further reduce the iteration time. This iterative process terminates when a target iteration time cannot be met after a large number of tries.
The rotation scheduling [2] is another transformation-based approach for loop pipelining. It repeatedly up-or downrotates a schedule to get a more compact one under a set of resource constraints. An up-rotation (down-rotation) moves all operations in the last (first) control step to the step before (after) the first (last) step. A rotation is followed by some adjustment similar to percolation to be described next.
PBS (Percolation Based Synthesis) [32] is a technique originally developed for optimizing compiler. It folds a loop in two phases: optimal scheduling, and delaying-and-percolating operations. In the first phase, the loop is incrementally unwound. As new copies of DFG's are brought in, operations are allowed to migrate upwards without regard to either iteration boundaries or resource constraints. This migration is only constrained by the data dependencies. This unwinding process terminates when a repeating pattern emerges in the unwound DFG. The length of the repeating pattern implies the initiation interval. In the second phase, the algorithm deals with states that violate the resource constraint. Mobility [28] is employed as the criterion for delaying operations. An operation is either delayed to the next violation-free state or a new state inserted after the current state. If a new state is inserted, some operations may be percolated upward from the later states. This process continues until no more resource constraint violation exists.
Each of the above approaches has its own weakness. The ILP approach is very timeconsuming. Therefore, it cannot be used for large designs. The list-based approach schedules operations into an early iteration without the knowledge of the unscheduled portion of the DFG. PLS improves the listbased approach based on the assumption that operations tend to go to the earlier states of every iteration. However, this is not always the case. ATOMICS'S scheduling results are very sensitive to the initial schedule. PBS has no control over the loop overhead (i.e., control cost). Moreover, it pays no attention to the iteration time. Thus, it often produces data paths with excessive storage elements and a large control unit.
A GRAPH MODEL
Like most previous work, we use a dependency-graph (DG) data structure to capture the behavior of a loop. A DG is derived from a data-flow graph (DFG) which is commonly used by the DSP community to describe a signal processing, especially filtering, algorithm. In a DFG, a net connects one source node and one or more sink nodes, while in a DG each net connects a pair of source and sink nodes. Fig. 2 (b) depicts a DFG representing the second-order IIR filter algorithm described in Fig. 2(a) . Formally, it is a vertex-
is the node set and N the net set. Each node in C represents a computational node, in D a delay element, in I a primary input, and in 0 a primary output. In Fig. 2 
and 0 = { Y } , Each net in N has the format s -, ( t 1 , t 2 , . . . ) , where s is the source and tl , t 2 , . . are the sinks of the net. In
A DG is simpler than a DFG. It keeps only the information that is essential to loop folding. It is a directed and edge weighted graph, DG(C, E), where C is the same as that of the DFG and E the edge set. Each edge in E represents a path in the DFG from a computational node to another through zero or more delay elements. The number of delay elements along the path is used to weight the edge. For example, Fig. 2 (c) depicted the DG derived from the DFG shown in Fig. 2(b) . 
_I
Each edge in the DG represents a data dependency of degree 6, where 6 is the edge weight. It indicates that the data produced by the source computational node will be used by the sink computational node S loop iterations later. In other words, a dependency relation Opl + Op2 in a DG represents that the data item produced by operation node Op, in a particular iteration will be consumed by operation node Opz 6 iterations later. On the other hand, if the degree of a data dependency equals to zero, the data item will be produced and consumed in the same iteration. Obviously, the degree of each data dependency is greater than or equal to zero.
In a loop folding schedule, such as the one depicted in With the folded-DG representation, we imply that the execution sequences (patterns) of the operations are identical for all iterations. This is consistent with all previous work except PBS [32] , which may schedule different iterations differently. 6 
IV. ALGORITHM OVERVIEW
The proposed algorithm is transformation-based. Its inputs include: a loop behavior represented as a DG, a target initiation interval, and a set of resource constraints. Its output is a feasible schedule of the DG. Its goal is to minimize the iteration time of the schedule. A schedule is feasible if it satisfies both the data dependencies and the resource constraints.
The proposed algorithm consists of two phases:
1. As Soon As Possible Pipelined Scheduling (ASAPp) :
This phase takes into account the data dependencies and the target initiation interval but ignores the resource constraints. The result is a legal schedule with the shortest possible iteration time under the initiation interval constraint. A schedule is legal if it satisfies all data dependency.
2.
Algorithm FolhSehcduling (DG'. I Resolving Resource Constraint Violations: In a legal schedule, there may exist violations against the resource constraints. Whenever a resource constraint violation occurs in a control step, some operations in that control step need to be rescheduled. The selection of candidates for rescheduling is based on a priority function to be described in Section 6.2. This process iterates itself until no more resource constraint violation exists (i.e., the schedule becomes feasible). We use a deterministic polynomial time algorithm for the first phase and a heuristic iterative one for the second. In the second phase, the algorithm makes several passes (the repeat statement) over a legal schedule (SDG). It first calls the subroutine ALAPp to compute the latest possible state into which each operation can be scheduled under the given target initiation interval II and the iteration time IT.
Initially, I T is equal to the iteration time of the ASAPp schedule. The subroutine ViolationResolving returns a feasible schedule if one has been found. Otherwise, it returns the "NULL." If no feasible schedule is found during a pass, the algorithm increases IT by one and restarts the second phase.
Since the resource constraints may be inadequate for achieving the desired performance level (given via the initiation interval), it is possible that the second phase will never terminate. To prevent the algorithm from pursuing an impossible goal, we set a bound on the execution time by calling the Time-out function after each pass.
Although our two-phase approach is similar to that of PBS [32], the proposed algorithm has several unique features:
It needs only one copy of the DG during loop folding. It allows the user to specify the initiation interval. It does not have to detect the repeating pattern. It inserts new states only after the last state.
It uses a more sophisticated priority function to select candidate operations for rescheduling In the next two sections, we describe procedures ASAPp and VioZationResolving, respectively,
V. THE AS SOON AS POSSIBLE ~PELJNED SCHEDULING
Given unlimited resources and a target initiation interval, this procedure schedules the loop as soon as possible in a pipelined fashion. To obtain a legal schedule, we first introduce the following lemma from where Aop(Aopt) is the state into which Op(0p') has been scheduled, E X E o p the execution delay of Op, and 11 the target initiation interval. Note that Aop = j + (i -1 ) x 11 if Op is scheduled into the jth step of the ith fold.
A DG with no dependency cycle can be scheduled to achieve any initiation interval (11) no smaller than one control step. When I I = 1, all operations are scheduled to the same control step and one function unit is needed for each operation (i.e., a maximally parallel design). For a DG with dependency cycles, there is a bound on the achievable initiation interval. For example, Fig. 6 shows a dependency cycle of a DG. It has three degree-zero data dependencies ( A 3 B, B 3 C, and C 3 0) and one degree-one data dependency (D + A ) . Suppose each operation takes one control step to execute. We cannot obtain any legal schedule for the DG under a target initiation interval 11 = 1,2, or 3 control steps. This phenomenon was first observed in [33] . . . Opn} and the data dependencies be Opl + Opz, Op2 3 0~3 , .
. , O P , -~ ' 5 ' Op,, 'It may require extra registers during data path allocation if an operation is scheduled out of iteration [13] . If there exists no dependency cycle C in the DG such that IC( > 6c x 11, there exists a legal schedule.
Proof:
We prove this lemma by presenting an algorithm to construct a legal schedule. Firstly, all operations are scheduled into the first control step. Then we legalize the schedule relying upon the following rule:
If there exists a data dependency Op + Op' violation against Lemma 1, we delay Op' as little as possible to satisfy Lemma 1. In other words, if there exists a data dependency Op -+ Op' such that Aopt < Aop + E X E o p -6 x 11, let Aopt = We apply this rule repeatedly until no more data dependency violation exists. Clearly, if the procedure halts, the produced schedule is legal. The only situation preventing the procedure from terminating is when all but one data dependency in a dependency cycle are satisfied and the resolution of the violation will cause a new violation. Let the cycle be C = { O p l l O p~,~~-, O p n } and the dependencies be Opl A Opz, Op2 4 0~3 , .
. , Opn-l --+ Op,, Aop + E X E o , -S x I I . Theorem 1: Given a DG and a target initiation interval I I . The ASAPp algorithm can find a legal schedule with the minimum iteration time if, and only if, the target initiation interval 11 is greater than or equal to the initiation bound of the DG.
Proof: (+) From Lemma 2 we learn that if the target initiation interval is less than the initiation bound, there exists no legal schedule even with unlimited resources. In other words, if a scheduling algorithm can find a legal schedule, the target initiation interval must be greater than or equal to the initiation bound.
(e) By the proof of Lemma 3, if the target initiation interval
is not less than the initiation bound, the ASAPp scheduling can find a legal schedule. Moreover, according to Lemma 1 the ASAPp scheduling produces a schedule with the minimum iteration time.
0
there exists no cycle C such that IC1 > 6, x I I ) .
An Illustrative Example
We use Fig. 8 to illustrate how ASAPp works on the second-order IIR filter example when the target initiation interval is two control steps. Initially, ASAPp puts all operations into the first step of the first fold (state 0), as depicted in Fig. 8(a) . It then finds, among others, the data dependency constraint between operations *a and +e is not satisfied. This violation is resolved by moving operation +e to the second step of the first fold as depicted in Fig. 8(b) . After rescheduling operation +e, we find a degree-zero data dependency violation between operations +e and +g, as shown in Fig. 8(b) . To resolve this violation requires operation +g being rescheduled into a state no sooner than one state (EXE+, -S x I I = 1 -0 x 2 = 1) after operation +e's. Delaying operation +g to the first step of the second fold (state 2 ) will satisfy this dependency constraint (Fig. 8(c) ). However, this leads to two degree-one dependency violations (i.e., operations +g to *b and +g to *c) and one degree-zero dependency violation (i.e., operations +g to +h). The violations are resolved by rescheduling operations *b and *c into the second step of the first fold (state 1) and +h into the second step of the second fold (state 3), as depicted in Fig. 8(d) . Finally, operation +f is rescheduled into the first step of the second fold (state 2) to resolve the degree-zero data dependency violation between it and operation *c. The final legal schedule is shown in Fig.  8 (e).
Time Complexity Analysis
Let i be the iteration time, n the number of operations, and m the number of data dependencies. To find a data dependency violation takes at most m examinations of the graph edges. Since we only move the operations downwards, each operation will be moved at most i times. Therefore, the time complexity of the ASAPp scheduling algorithm is O(i x n x m).
VI. RESOLVING RESOURCE CONSTRAINT VIOLATION
The schedule obtained from the ASAPp phase is legal and has the shortest possible iteration time under the target initiation interval. However, it may be infeasible because the resource constraints are totally ignored during the ASAPp scheduling process. If there exists no resource constraint violation, the loop folding task is done. On the other hand, if there is any state that contains more operations than the number of available functional units, the schedule has to be modified. We now face the problem of how to select the operations for rescheduling and how to actually carry out the rescheduling. In this section, we first propose a rescheduling algorithm for resolving resource-constraint violations, then we present a new priority function used by the rescheduler to select the rescheduling candidates.
The Rescheduling Algorithm
The inputs to the proposed algorithm include a legal schedule produced by the ASAPp under a target initiation interval, an iteration time given by the main algorithm, FoldScheduling (Fig. 5) , and the resource constraint. The algorithm returns a feasible schedule if one has been found. Otherwise, it returns the "NULL" value.
A pseudo code description of the rescheduling algorithm is given in Fig. 9 . SDG is a legal schedule, I 1 the target initiation interval, I T the given iteration time, and RC the set In each pass of the while loop, if there exists one or more control steps in SDG violating the resource constraint, we select the first one (control step S) for resolving the violation. Subroutine H C ( S , t o p ) returns the cost of the hardware required to implement the operations of type top that have been scheduled into control step S. Delaying an operation with zero dynamic mobility will lengthen the iteration time, i.e., violate the given iteration time. Therefore, a candidate operation in U must have a positive dynamic mobility value. Note that the look-ahead schedule S,"p, is computed by delaying Op one state and successively invoking subroutine
P u s h B o w n ( S~~, O P )
(see Section V) to satisfy the data dependencies. If a feasible schedule has been found, the algorithm returns it. Otherwise, the "NULL" value is returned to the FoZdScheduZing algorithm (see Section IV). The latter case occurs when the temporary variable U is equal to zero. In other words, there exists a control step S which violates the resource constraint, but the dynamic mobility value of each candidate operation within S is zero.
The Priority Function
The violation resolving phase relies on a priority function to select operations for rescheduling. The priority function has a profound effect on not only the scheduling quality but also the computational efficiency. We propose a global priority function called total difference . The priority function is calculated in three steps. In the first step, we construct a look-ahead schedule for each candidate operation. In the second step, we score each look-ahead schedule. Finally, we rank the candidate operations according to the scores. The detailed scoring scheme is described as follows.
Definition 2:
Given a schedule SDG and a set of resource constraints RCt, 's, the total diflerence of a look-ahead schedule S,"p,, obtained by delaying O p as little as possible, is defined as:
where AV,,t, denotes the average hardware cost required to implement operations of type t j scheduled into control step i in the look-ahead schedule S,"p,. The average hardware cost AV,,tl is calculated as:
where Prob(Op, i) is the probability of scheduling O p into control step i and HCop the cost of a functional unit for
In order to illustrate the priority function, let us return to the second-order IIR filter example shown in Fig. 2 . Assume that an addition (a multiplication) is executed on an adder (a multiplier) in one clock cycle. In addition, the cost of an adder and a multiplier is one unit. Fig. 10(a) shows the results of the ASAPp scheduling with the target initiation interval of four control steps. Fig. 10(b) O p is of t y p e t, for the look-ahead schedule SYP, as follows: depicts the ALA& schedule with a given iteration time of six clock cycles. Assume that the resource constraint is one adder and one multiplier, then it is clear that control step 0 violates the resource constraint. To resolve this resource constraint violation, at least one of the four multiplications (*a,*b,*c, and * d ) have to be rescheduled into later states. Consequently, we pick operations *a,*b,*c, and * d as the candidate operafions for rescheduling. Next, for each candidate operation, we construct a lookahead schedule by delaying it as little as possible according to data dependencies. Fig. IO(c)-(f) show the look-ahead schedules SzG, SgG, SgG, and SgG, respectively.
By the look-ahead schedule and the ALAPp schedule, we can obtain a timeframe [3 11 for each operation. For simplicity, we assume that an operation has the same probabilities being scheduled into any state in its time frame. For example, by Fig.  10(b) and (c), we know that * a can be rescheduled into either state 1 or state 2. Hence we set the probability of scheduling * a into either state to 1/2. Table I shows the probability distribution function on states. Since states 0, 1, 2, and 3 and states 4 and 5 are overlapped in a pipelined fashion into control steps 0, 1, 2, and 3, respectively, the probability distribution function on control steps is shown in Table 11 .
Based on the cost assumption and the priority function shown in Table 11 , we can compute the average hardware cost Similarily , oiff(s2G) = i,
Diff(S;;d,) = 4.
Since the smaller the total difference is, the better the corresponding candidate operation is, we select *b for rescheduling first. Fig. 11 depicts the final feasible schedule after all violations have been resolved.
Time Complexity Analysis
We now analyze the time complexity of the violation resolving algorithm. Let n be the number of operations, m the number of data dependencies, and i the iteration time. Since we only move the operations downwards, each operation will be moved at most i times, the algorithm will execute at most 
VII. EXPERIMENTS
We have implemented the proposed algorithm in a C program running on a SUN-4/490 workstation. Allowing constraint on the number of buses. The experiment consists of two parts. First, we compare Theda.Fold with several published approaches. Then, we report results on some benchmarks that have not been used before in the open literature.
Comparisons with Previous Work
This section reports the results on the 16-point FIR filter [30] , the fifth-order elliptic filter [22] , the FDCT benchmark [26] , and Cytron's DG [3] . Theda.Fold is compared with RS [2] , ATOMICS [13] , Spaid [14] , [15] , PLS [17] , Sehwa [30] , HAL [31] , and PBS [32] .
16-point digital FIR filter benchmark is found in [30] . To ensure a fair comparison, we follow the assumptions below made by previous research:
The 16-point FIR Filter Benchmark:
The an addition takes 40 ns to perform, a multiplication takes 80 ns to perform, a latch delay is 20 ns, the clock cycle is 100 ns, and operations chaining is allowed. HAL is the force-directed pipeline scheduler [31] . In terms of the initiation interval, Theda. Fold achieves the optimum schedule for all cases because every 11 is equal to the theoretical bound determined by the number of operations (eight multiplications and 15 additions) and the resource constraints. In terms of the iteration time, Theda.Fold is better than SehwaB in the only case reported previously.
fifth-order elliptic filter benchmark [22] consists of 26 additions and eight multiplications. Like previous work, we assume that each addition takes one clock cycle (cc) while each multiplication takes two to execute. Two set of experiments have been performed: one without bus constraint and the other with bus constraint. Moreover, the usage of both nonpipelined and pipelined multipliers (denoted by * and * p , respectively, in the tables to be described) are experimented separately. When the number of buses is not limited, we have compared Theda.Fold with HAL [31] using both the Force-Directed Scheduling (FDS) and the Force-Directed List Scheduling (FDLS), the Percolation-Based Scheduling (PBS) [32] , and the Rotation Scheduling (RS) [2] . Table IV shows that Theda.Fold is able to achieve higher-level of performance given the same set of functional units.
When the number of buses is limited, we have compared Theda.Fold with Spaid [14] , [15] and PLS [17] . Table V - 5   -10  3  3  3  14  6  6  11  11  2  2  2  12  8  8  12  12  2  2  2  10  10  8  14  13  2  2  2  8  12  11  12  12  1  1  2  8  13  13  16 that the algorithm is to be performed on many sets of data. Therefore, we treat the DFG as the body of a loop. Note that there is no inter-iteration data dependency.
The only work that reports on this benchmark is the PLS. Table VI shows that Theda.Fold achieves the same level of throughput rate for various combinations of function unit constraints but requires fewer buses than PLS. In several cases, Theda.Fold is better in not only the initiation interval but also the iteration time.
7.1.4
Cytron's DG Example: This benchmark [3] has been commonly used in optimizing compiler research. All operations are assumed to be executed on an ALU in one clock cycle.
In the DG, the length of the longest path is six and the longest cycle three. Therefore, there is a lower bound of six clock cycles on the iteration time and three clock cycles on the initiation interval. Its pipelined schedule has been previously reported by ATOMICS [13] , PBS [32] and PLS [17] . The Table VI1 show that Theda.Fold achieves an iteration time equal to the lower bound (six clock cycles) in all cases, while every previous approach produced suboptimal designs in at least one case.
Other Benchmarks
This section presents the experimental results on three benchmark examples that have not been reported before. They are the second-order IIR filter [16] , the third-order IIR filter [19] , and the IDCT benchmark [7] , [35] . In the first two benchmarks, each addition is executed on an adder in one clock cycle; while each multiplication is executed on a nonpipelined multiplier (*) in two clock cycles or on a two-stage structural pipelined multiplier ( * p ) . In the IDCT benchmark each addition/subtraction is executed on an ALU in one clock cycle, and each multiplication is executed on a non-pipelined multiplier in two clock cycles.
Since there is no previous result for comparison, we just report the scheduling results in Tables VIII-X.
VIII. CONCLUSIONS
We have proposed a two-phase transformation-based scheduling algorithm for loop folding. The first phase produces a legal schedule using an as soon as possible (pipelined) algorithm. The second phase transforms the legal schedule into a feasible one by selectively rescheduling operations. We have also designed a new priority function for selecting candidate operations during rescheduling. The proposed algorithm has been implemented in a C program named Theda.Fold. Experimental results over a set of benchmarks have demonstrated that Theda.Fold is more effective than previous ones. In most cases, we are able to either achieve a higher-level of performance using the same amount of hardware or use less amount of hardware to achieve the same level of performance. Therefore, Theda.Fold is useful to the designer in exploiting the cost-performance tradeoff of a design.
The proposed algorithm is also applicable to the design of optimizing compilers for computers based on the Very Long Instruction Word (VLIW) architecture. Incorporating register allocation and memory synthesis techniques, we are using Theda.Fold to help generating the microcode program of a DSP processor.
Rescheduling an operation across the iteration boundary can be viewed as retiming [25] . Retiming has been used in sequential logic synthesis to minimize the clock cycle length. Therefore, it is possible to apply a loop folding technique to sequential logic retiming or vice versa.
