Linear dierence equations involving recurrences are fundamental equations that describe many important signal processes, in particular, innite-duration impulse response (IIR) lters. Applying conventional dependence-preserving parallelization techniques such as software pipelining can only extract limited parallelism due to loop-carried dependences in the linear recurrences, and thus cannot achieve scalable speedup given more resources. Furthermore, the previously published scheduling techniques do not have the capability for exploring design space of parallel schedules implementing IIR lters. In this paper, we present a novel approach, based Harmonic Scheduling, that explores design space of scalable parallel schedules implementing IIR lters with resource constraints. The salient features of our approach include a mathematical formulation of the relationship between the schedules, resource constraints and target performance, and capabilities for exploring design space in terms of those parameters. In particular, our approach can be used to successively approximate time-optimal schedules implementing IIR lters for a given target architecture. We illustrate our approach by giving an algorithm for deriving scalable schedules for IIR lters with a xed number of identical multi-functional processors. As a further illustration, we derive rate-optimal schedules for IIR lters under more realistic constraints: using a xed number of adders and multipliers and assuming that multiplication and addition take dissimilar execution times.
Introduction
Many important digital ltering processes are described using linear dierence equations that express the lter's current response y(nT ) as a function of past responses: 
y This is an expanded version of the paper published in Proceedings of the First European Design Automation Conference (EuroDac'92), Hamburg, Germany, Sept. 1992. 3 This work was supported in part by NSF grants CCR8704367, MIP9009239 and ONR grant N0001486K0215.
scalable parallel schedules with a xed number of identical multi-functional processors for second-order IIR lter sections using Harmonic Scheduling. Section 6 derive rate-optimal parallel schedules for second-order IIR lter sections using our scheduling technique. with more realistic resource constraints: the target architecture consisting of a xed number of adders and multipliers and assuming a multiplication takes twice as much time as an addition. Section 7 discuss the signicance and implications of our techniques and future work.
Related Work
Related work has been done in two areas: scheduling algorithms for high-level synthesis and parallel computing. Scheduling algorithms for high-level synthesis have been exploiting three forms of parallelism: concurrency, functional pipelining and loop pipelining. Concurrency [3] explores instructions that may be executed simultaneously. Functional pipelining techniques overlap instructions along the data paths [3] . Loop pipelining techniques transform a sequential loop into a loop with parallelism across multiple iterations extracted while preserving the program's semantics. Since scheduling with resource constraints in general is NP-complete, heuristic-based loop pipelining techniques [5, 19, 7] have been developed to compact loops with given resource constraints.
Percolation-based loop pipelining techniques [18] rst compact a loop into its optimal parallel counterpart and then apply resource constraints on the parallel version. Static scheduling via optimum unfolding [16] in DSP respects the inherent lower bound on the achievable iteration period imposed by the recursions in the program, thus is essentially equivalent to aggressive software pipelining on a class of DSP computations without resource constraints in terms of resulting parallelized programs. All of the techniques above do not try to reduce the critical path by exploiting parallelism beyond source-level LCD's, and hence cannot generate scalable parallel computation implementing IIR lters as the number of computing units increases. Techniques that extract parallelism beyond LCD's have received attention in the high-level synthesis and digital signal processing communities. A few techniques based on tree height reduction [6, 14, 13] have been proposed to reduce, using extra operations, the length of critical paths of computations consisting of a sequence of additions. However, these techniques are unable to generate scalable schedules for computations with diering types of operations intermixed, e.g., linear recurrences. Scalable scheduling methods for computation involving linear recurrences as shown in Equation 1 with resource contraints have been studied fairly recently. Successive approximation of optimal schedules for recursive lter computation described with linear dierences beyond LCD's was rst studied in [21] on a PRAM machine model with a xed number of identical multi-functional processors. Scheduling methods for generating optimal schedules for linear dierences beyond LCD's on more realistic architectures, such as one with dierent types of functional units having dierent time constraints were discussed in [22] . This paper presents an expanded version of results published in [21, 22] .
In the eld of parallel computing, parallel evaluation of linear recurrences has been studied for quite some time. However, the underlying model diers from high-level synthesis, since parallel computing typically deals with a number of identical, multi-function processors, while high-level synthesis deals with a variety of functional units that dier in several attributes (e.g., functionality, cost, speed). We briey review some milestone results in linear recurrence evaluation from parallel computing. A survey of techniques can be found in [11] . Kogge and Stone [10] described a technique called recursive doubling for computing the rst-order linear recurrence system with unlimited resources. Our technique diers signicantly from Kogge and Stone's work in that
(1) we addressed resource constraints while recursive doubling did not, (2) our technique does not tie the derived schedules to any particular form but rather gives the capability of exploring schedule design space, while recursive doubling gives a particular form of parallel schedule and thus cannot be used for schedule design space exploration. Chen and Kuck developed an algorithm [1] for computing N m-th order band linear recurrence with p processors that achieves a time bound of (N=p)(2m approach, Harmonic Scheduling [24] , has been proposed for generating scalable parallel schedules for band linear recurrences. Using Harmonic Scheduling, strict time-optimal schedules for rst and second order band linear recurrences were found [24] . Harmonic Schedules diers from all previous work in its capabilities of exploring dierent types of schedules that satisfy the throughput requirement and of relating performance of schedules to their resource constraints. These capabilities of Harmonic Scheduling will facilitate the study of trade-os among design criteria in high-level synthesis of circuitries involning linear recurrences.
The technique presented in the paper is dierent from the systolic-array-based technique, in that our technique parallelizes the computation beyond loop-carried dependences while the algorithms on which the original systolic architecture [12] is based do not. Moreover, our technique do not tie to any particular kind of target architecture while systolic-array-based algorithms are systolic-array architecture specic. In fact, once the target architecture and resource constraints are given, our technique can be used to explore scalable schedules for band linear recurrences that t the resources [23] .
Assumptions, Terminology and Denitions
In order to highlight the technique for exploiting recurrences, we make some simple but realistic assumptions about the target architecture and the functional units used. The technique we describe is still applicable if these assumptions are relaxed to model more complex realistic architectures. Two types of architectures for our DSP computations are used in this paper, described below.
The rst type of architecture (used in Section 5) comprising a xed number of identical multi-functional processors. Each processor performs an arithmetic operation, e.g., addition and multiplication, in a single unit of time. We do not assume functional pipelining of in the processors We assume that each processor has a dedicated control unit (i.e., localized control with a register-le or memory for microinstructions) that controls the processor in each time step and that all processors are synchronized by a central clock in each time step.
We ignore the eect of the number of intermediate registers (storage units) required to store the results of redundant operations.
The second type of architecture (used in Section 6) comprising a xed number of dierent types of functional units (FU's): adders and multipliers. An adder performs in a time step (i.e., a time unit) a logic operation, a simple arithmetic operation such as an addition or a subtraction. A multiplier performs a multiplication in two time steps, twice as much time as an addition. We do not assume functional pipelining of the FU's in order to simplify our presentation. We assume that each FU has a dedicated control unit (i.e., localized control with a register-le or memory for microinstructions) that controls the FU in each time step and that all FU's are synchronized by a central clock in each time step. We also ignore the eect of the number of intermediate registers (storage units) required to store the results of redundant operations. Our objective is to generate an optimal schedule (i.e., a maximum possible sample rate) for this allocation that exploits LCD recurrences, followed by a minimization of the control overhead for each FU (i.e., minimize the size of the microprogram for each FU). We also want to know the precise combination of adders and multipliers that achieves full utilization when running a rate-optimal schedule for a given number of adders (or multipliers).
We now dene some performance measures for a system consisting a xed number of multi-functional The nal values of a linear recurrence are the values computed by the denition of the recurrence, i.e., the values assigned to the left-hand side of Equation (2) for 1 < i N and 1 < k N . This is referred to as the matrix decomposition of the evaluation. The sequential matrix chain multiplication above can be represented as a dependence tree as shown in Figure 1 . The computation above cannot be signicantly parallelized without breaking the dependences and introducing redundant operations, because each result node X i ; 2 i k, depends on the previously computed result node
By introducing more operations, a parallel evaluation of a system of eight 1st-order linear recurrence can recurrences, a redundant matrix is composed of three arithmetic operations, which can be done in two parallel chains of arithmetic operations, one consisting of a multiply followed by an add, and the other consisting of one add. As for the result nodes, the columns of numbers associated with the redundant nodes represent the time steps at which an FU is allocated to execute the operations in those nodes. The parallel schedule in Figure 2 computes the given recurrence in six steps on seven FU's. We construct a schedule with dierent types of schedule-components, which are based on the number of arithmetic operations in the component. For example, Figure 3 shows the two types of schedule-components for rst-order linear recurrences. The rst type of component consists of a nal vector-by-matrix multiplication using two operations, for example, X 2 and X 3 make two components of the rst type respectively. The second type of component is composed of a redundant matrix-by-matrix multiplication using three operations and a nal vector-by-matrix multiplication that uses the redundant matrix-by-matrix product, for example, nodes R 1 with X 5 and R 2 with X 6 make two components of the second type. In general, (m + 1) types of schedule components are needed for an optimal schedule for mth-order linear recurrences.
The example in Figure 2 shows a well-known method for parallelizing bend linear recurrences, called treeheight reduction (THR) [8] , and recursive doubling [10] that can be viewed as a kind of THR. As shown in Figure 2 , the idea of THR is to shorten the critical path of the computation tree for sequential evaluation of band linear recurrences by introducing redundant computations. The full THR schedule in Figure 2 computes the given recurrence in six steps on seven FU's. Although the tree height is lowered with parallelism resulting from the redundant matrix multiplications, note that the FU utilization is poor: U 7 = 27=42. This example implies the following. Given a xed number of processors (it is practically impossible to have the number of processors proportional to the problem size of the computation, i.e., the number of results we want to compute in linear recurrence evaluation), a parallel schedule with better processor utilization than the full THR schedule may well outperform the full THR schedule. This observation has motivated the paper.
Scheduling of Band Linear Recurrences beyond LCD's
We adopt a technique called Harmonic Scheduling [15, 24] , whose basic objective is to simultaneously minimize the number of redundant operations and maximize the utilization of the allocated FU's; we call these tasks the dual requirements of Harmonic Scheduling. We know that the number of nal operations in any schedule equals at least the number of operations in the sequential schedule, which cannot be reduced due to the denition of the required outputs. The only mechanism for signicantly speeding up the computation is to use multiple FU's to compute redundant values ahead of the nal value computation; this shortens the critical path in the computation of some nal values, and makes available multiple nal values in the fewest possible parallel steps by using previously computed (nal and redundant) values.
In order to compute as many nal values as possible, given a xed number of FU's, a schedule should do as few redundant operations as possible. However, one cannot reduce the number of redundant operations arbitrarily, say, to zero, since this will sequentialize the computation of nal results (thus making a parallel schedule degenerate into a sequential schedule in the limit), and will reduce the speedup to a minimum. Intuitively, the fastest parallel schedules for a xed number of FU's should and would satisfy the dual requirements. Such a state, in which the dual requirements are satised, can be seen as having achieved \harmony" among the conicting goals, hence the name \Harmonic Scheduling".
A parallel schedule can be divided into a number of periods that have the same organization. Such a period is constructed with a combination of all types of components satisfying the dual requirements. The optimality of the entire parallel schedule may be proved based on the optimality of each period. The numbers of all types of components to be used in constructing an optimal period are determined by the feasible solutions for the system of inequalities in [24] that relates the execution time to the pattern of the schedules. With this information, the computation tree for the period is constructed, and functional units are assigned to all arithmetic operations in the tree for each step. The complete schedule is nally obtained by constructing the computation tree at the operation level. A detailed example illustrating the procedure for constructing such schedules for a xed number of identical multi-functional processors is given in [24] . In the next section we illustrate our technique and prove the optimality of our schedules under more realistic constraints: using a xed number of adders and multipliers and assuming a multiplication takes twice as much time as an addition. 
Figure 4: The computation tree for p = 6.
We illustrate this procedure with an example. For p = 6 FU's, an optimal schedule needs ve components of the rst type and 10 of the second type. Its computation tree in terms of matrix multiplication is constructed as shown in Figure 4 . To facilitate FU assignment, that tree is further expanded into an equivalent tree in terms of arithmetic operations as shown in Figure 5 . Figure 5: The computation tree with allocated functional unit slots for p = 6.
In Figure 5 , the boxes represent the arithmetic operations in the computation tree for a period. The numbers inside the boxes represent the time steps at which the FU slots compute the operations. Each column of boxes in the top two rows corresponds to a nal node. Each column of boxes in the bottom three rows corresponds to a redundant node. The numbers associated with redundant operation boxes are the indices for a temporary array t that stores the redundant values.
In each time step, we have p = 6 FU slots. In step one, we allocate one FU slot to the rst operation on the critical path, and ve slots to the redundant trees, four of which are allocated to the four long chains and the last one to the rst short chain. Thus we see all the operation boxes marked with \1".
Step two is similar to step one: the rst slot is allocated the operation on the critical path, and the other ve are allocated to the redundant tree, four of which are allocated to the second operations on all the long chains and the last slot to the second long chain. In step ve, we allocate the rst slot to the operation on the critical path, and the next four to the operations in the third and fourth redundant trees. At this point, there are no available operations in the redundant trees for the last slot in step ve; thus we ll the slot with an available nal operation that is not on the critical path. The allocation proceeds until in step ten, when all the remaining nal operations are allocated|our 60 processor slots are precisely lled with 60 operations in the computation tree for a period.
In In lter design, the sensitivity analysis indicates that a less-sensitive lter structure may be obtained by breaking up the transfer function into lower-order sections and connecting these sections in parallel or in cascade.
Although high-order blocks may be attractive in some applications, the second-order section is a good building block to use in parallel or cascade structures [17] . Needless to say, the Harmonic Scheduling method applies to dierence equation (1) in general.
The dierence equation for the second order section is obtained from (1) by setting a 0 = 1 and all other coecients to zero except for b 1 and b 2 , yielding a recursive second-order all pole section with complex conjugate poles [20] . With little modication, the same schedule (for the simple second-order section) can be applied to general second-order sections.
This is a second-order linear recurrence with constant coecients, which if parallelized with LCD-preserving 
where y i ; x i stand for y(i); x(i) respectively. Note that for any redundant matrix multiplication in the chain, the results in the rst two rows of the product can be precomputed since they remain the same through all periods of a schedule.
Since we have a second-order linear recurrence, three types of schedule-components are involved in our parallel schedules; each type of schedule component is determined by the number of arithmetic operations as For any redundant matrix multiplication in the chain, the results in the rst two rows of the product can be precomputed since they remain the same through all periods of a schedule. Based on the number of arithmetic operations in a matrix multiplication above, three types of matrix multiplication will be involved in our parallel schedules, the rst type using 4 arithmetic operations, the second type using 6 and the third type using 8
respectively. Next, we go through the Harmonic Scheduling phases for dierence equation (3).
In the rst phase, we solve some inequalities for the numbers of all types of components to be used in our parallel schedule. Harmonic Scheduling establishes the following system of inequalities and asks for integer In the inequalities, w 0 ; w 1 ; w 2 are numbers of the rst, second, and third type components respectively, and p is the number of functional units. The rst equation relates the pattern of a period of a parallel schedule to its execution time. It essentially says that, in order to achieve an execution time T p for p functional units, a parallel schedule must produce w 0 +2w 1 +w 2 nal results (i.e., sample outputs) using every 4w 0 +10w 1 +8w 2 operations.
The other inequalities serve as necessary conditions for the feasibility of solutions. The rst inequality says that a solution would give an infeasible schedule (i.e., the schedule cannot achieve the execution time T p ), if the length of the critical path in a period (described by the left-hand side) were greater than the number of steps required by the number of operations. The other three simple inequalities constrain the solution elements to be non-negative. Specically, a parallel schedule must have some second type components.
We use execution time T p = (8p 0 4)N=(p(p + 1)) in inequalities (8) , which gives a speedup of (p + 1)=(2 0 1=p) > p=2 over sequential schedules { the best time so far. Although it is not the strict time lower bound, the strict time lower bound can be achieved by lowering T p in inequalities (8) . A set of integer solutions to inequalities (8) parameterized for p is given below:
w 0 = 2t 0 ; w 1 = 2t 0 ;
Not all integer solutions given by those expressions can be made into feasible schedules. But for any p > 2, there exists t 0 such that a feasible schedule can be built out of the solutions. A detailed discussion is given in [24] . The following algorithm outlines the second to the fourth phases of Harmonic Scheduling in constructing a feasible schedule using the solutions above.
Input: p functional units for constructing a schedule for parallel evaluation of the second-order dierence equation (3), with w 0 components of the rst type, w 1 of the second type and w 2 of the third type.
Output: a parallel schedule.
procedure construct schedule for lr(p, w0; w1; w2) 1. call procedure build computation tree(components of all types); 2. call procedure FU slot allocation(tree, slot sets); 3. transform the computation tree with allocated functional unit slots into a schedule; end (construct schedule for lr) Figure 6 : The computation tree in matrix multiplications for p = 4.
To illustrate the second through the fourth phases, we assume p = 4 and choose t 0 = 1, thus having w 0 = 2, w 1 = 2 and w 2 = 1. In the second phase, we construct a computation (dependence) tree for a period of the parallel schedule with the given components obtained from the rst phase, using the following procedure. The computation tree for p = 4 is shown in Figure 6 . In the third phase, we allocate functional unit slots to operations in the computation tree. The functional unit slots are allocated to the nal operations of the leading period and to the redundant operations of the succeeding period. The functional unit slots are allocated on a \most-urgent-rst" policy, meaning that a functional unit slot is always assigned to compute the operation which if not computed by that slot would ruin the full functional unit utilization. Note that there are multiple ways of allocating the functional unit slots.
The slot allocation procedure is similar to the one in [24] . Figure 7 shows the computation tree for the rst two periods expanded down into the arithmetic operation level with functional unit slot allocation. Each box represents an arithmetic operation. The number inside a box following the operator is the functional unit slot assigned to compute that operation at time indicated by the number, which will be the time step in the loop body of the schedule. The operations below outputs y 2 through y 6 are in the leading period and those below outputs y 7 through y 11 in the succeeding period (the length of the period = 5). The symbols t 1 through t 23 In the fourth phase, we write the computation tree with functional unit slot assignments into the nal parallel schedule. We compose the nal computation of the leading period and the redundant computation of the succeeding period into the loop body of our schedule, combined with the startup code, and express it into a parallel schedule as follows. In the schedule, Table 1 gives the improvement in speedup over the sequential schedule by our schedule, generated using Harmonic Scheduling, on the previously published fastest parallel schedule [4] , and the savings in the schedule size (i.e., the number of operations in the loop body of the schedule plus the startup) by our schedule from that schedule for the second order dierence equation. The speedup improvement is determined by ( As the order of the dierence equation increases, the speedup improvement and schedule size savings by our schedule increase for number of functional units > 2. Note that for some number of functional units such as p = 4, the schedule size can be made even smaller due to some properties [24] of the integer solutions to inequalities (8).
Schedules for IIR Filters with Diering Types of FU's
In this section, we use the Harmonic Scheduling based technique to successively approximate rate-optimal parallel schedules implementing IIR lters. We introduce more realistic resource constraints than in Section 5: our target architecture comprises a xed number of diering types of functional units. We demonstrate the procedure on second-order IIR lter sections. Our scheduling method applies to dierence equation (1) in general, as explained in Section 7. We show that a parallel schedule can be generated that computes a system of N second-order linear dierence equations describing basic second-order lter sections in 16=(4p + 3)N time (a time step equal to the time for performing an addition) plus a negligibly small number of startup steps, using an application-specic design consisting of p adders and 2p multipliers. This gives a sample rate of (4p + 3)=16 sample outputs per time step, which we prove to be the maximum possible sample rate.
We construct our parallel schedule using schedule-components of the three types described in the beginning of Section 5. Observing that in a schedule-component of each type, the number of multiplications equals the number of additions, and knowing that a multiplication takes twice as much time as an addition, we thus choose p a > 1 adders and p m = 2p a multipliers as an initial combination. Later, we show that a full utilization schedule can be achieved on a design with this combination.
The problem of nding rate-optimal schedules can be formulated as follows. Letw = (w 0 ; w 1 ; w 2 ) be a vector of the number of schedule components of each type,c = (c 0 ; c 1 ; c 2 ) = (6; 9; 12) be the cost in time for a schedule-component of each type respectively andcw be their inner product. Let = w 0 + w 1 + w 2 be the number of nal results (i.e., sample outputs) in a period, called the period length. On a design with p a > 1 adders and p m = 2p a multipliers, a schedule, which is constructed with a combination of schedule-components w = (w 0 ; w 1 ; w 2 ) for a period, can achieve a sample rate N=T pa ;pm if the following conditions are satised: . Condition 3 require the solution elements, i.e., numbers of zeroth through second type components, to be non-negative and describe the relationships among all types of components. In particular, a parallel schedule must have some zeroth and rst type components. Note that the number of second type components could be more than number of any other type, but it depends on the number of rst type components: a parallel schedule that has no rst type component has no second type components (this constraint can't be described by a linear inequality since it is a conditional statement). Condition 2, called the critical path condition, constrains that, if the critical path length in a period of the parallel schedule constructed withw (described by the left-hand side) is greater than time steps required by the number of operations, the schedule will not accomplish the execution time T pa;pm . The last condition states the requirement for full resource utilization during the loop pipeline stage.
Our goal is to nd the integer solutionsw 0 s from these conditions such that it yields the minimum T pa ;pm =N (i.e., the maximum sample rate N=T pa ;pm ) on a design with p a > 1 adders and p m = 2p a multipliers. We found thatw s = (4dp a =4e; 4dp a =4e; dp a =4e(4p a 0 5)) is a family of solutions for the conditions above that gives a sample rate of (4p a + 3)=16 per time step or an execution time of 16N=(4p a + 3) plus a small number of steps for pipeline startup for N linear recurrences.
It can be veried thatw s and T pa ;pm =N = 16=(4p a + 3) satisfy the rst three conditions. Thus to show that parallel schedules with that performance can be constructed, it suces to show that the fourth condition is also satised. Usingw s , we can construct 4dp a =4e redundant trees such that each of the rst dp a =4e trees has height (p a 0 1) and each of the remaining 3dp a =4e trees has height p a .
We divide the number of FU's into four congruence classes: p a 0(bmod4) p a 1 (mod 4); p a 2 (mod 4), and p a 3 (mod 4), because the operation execution plan (i.e., which operations are scheduled for execution in each step) for each congruence class is slightly dierent. We give the FU allocation plan for congruence class p a 0 (mod 4). The plans for three other congruence classes can be derived similarly.
Our operation execution plan for the loop body of our schedule involves three consecutive periods as follows.
We choose to do two sequential nal additions in the leading period. Those two additions are selected such that delaying their computation does not hinder the progress of the computation on the critical path. In the second period, we choose to do all the nal operations except for the two additions which we leave for the next iteration.
Also in the second period, we choose to do the last (p Table (2) enumerates the number and the type of operations scheduled for execution in each step in the loop body of our schedule on a design with p a 2 adders and 2p a multipliers, for p a in the rst congruence class (i.e., p a 0(mod4)). Note that p a additions are executed in each step, and 2p a multiplications are executed in every two steps|hence full utilization is achieved. We can verify that the number of additions in the plan is 4p a = (2w 0 + 3w 1 + 4w 2 ); so is the number of multiplications. Now it suces to show that across three consecutive periods, there are p a additions available for execution in each of 4p a steps and there are 2p a multiplications available for execution in every two of 4p a steps. This can be shown by nding the dependence pattern in the computation of a period. We omit this detail, which can be found in [22] , due to the space limit of this paper. 2 Theorem 6.2 On a design with p a adders and p m = 2p a multipliers, a schedule constructed withw s achieves the maximum possible sample rate of N=T pa ;2pa = (4p a +3)=16 under matrix chain multiplication representation, i.e., it achieves the strict upper bound on the sample rate for second-order dierence equations (3).
Proof:
First, we show that 3w 0 + w 1 is the critical path length of the nal computation for a schedule constructed with schedule-components of three types, assuming a sucient number of FU's of each type. Second, we show that there is no feasible schedule with a sample rate greater than (4p a + 3)=16.
The rst statement can be shown true by constructing a computation tree with w 0 ; w 1 ; w 2 components of three types respectively, and then by calculating the critical path length of the nal computation in terms of time steps. The details of this part are given in [22] .
We prove the second statement by contradiction. Assume that there exists a schedule which achieves higher sample rate than (4p a + 3)=16, i.e., the time equation becomes (p a 0 dp a =4e 0 l 0 2) (dp a =4e + l + 1) j + 4l 0 2 2p a mults in two steps adds in each step adds in each step j + 4l 0 1 2 mults in two steps 2(p a 0 dp a =4e 0 l 0 2) 2(dp a =4e + l + 1) j + 4l p a adds in each step mults in two steps mults in two steps In the rst step of Algorithm 6.1, we determine the number of each type of schedule components to be used to construct the computation tree for a period of our schedule by solving the inequalities in the problem formulation. The problem formulation for nding optimal schedules for mth-order IIR lters can be derived similarly to that for second-order IIR lter (3).
In the second step of Algorithm 6.1, we construct a computation (dependence) tree for a period of the parallel schedule with those components obtained from the rst step, using procedure \build computation tree". The computation tree for p = 4 is shown in Figure 8 . In the third step, we allocate FU slots to operations in the computation tree for the three consecutive periods. The FU slots are allocated as planned in Theorem (6.1). The FU slots are allocated on a \most-urgent-rst" policy, meaning that a FU slot is always assigned to compute the operation which if not computed by that slot would ruin the full FU utilization. Note that there are multiple ways of allocating the FU slots.
We shall not detail the slot allocation procedure, which is similar to the one in [24] , due to the space limit of this paper. In the fourth step, we transform the computation tree of arithmetic operations with functional units assignment obtained in the third step into a parallel schedule in program form.
Next, we illustrate the steps of our scheduling method for dierence equation (3) using an example for a design with p a = 4 adders and p m = 8 multipliers. In the rst step, we solve the inequalities for the numbers of components of each type to be used in our parallel schedule, which we know to bew = (4; 4; 11). Figure 9 shows the computation tree of the rst two periods (the length of the period = 19) expanded to the arithmetic operation with FU slot allocation. Assuming there is a startup period, the top part corresponds to the entire computation tree for second period involved in the rst iteration of the schedule, and the bottom part to the redundant trees in the third period. A box with one or two numbers inside corresponds to an addition or a multiplication respectively. A number inside a box is the functional unit slot assigned to compute that operation at the time indicated by the number, which will be the time step in the loop body of the schedule. The symbols t 1 through t 109 correspond to the storage (e.g., register les or memory words) holding intermediate results. In the fourth phase, we write the computation tree with FU slot assignments into the nal parallel schedule, shown in the Appendix A. We compose the computation assigned to FU's across the three periods into the loop body of our schedule. The pipeline startup code computes those redundant operations in the rst period that are not allocated for execution in the rst iteration. In the schedule, [20] are read in and stored. During each iteration, = 19 sample inputs are read in and stored. This algorithm needs to store 19 sample inputs and 19 sample outputs in each iteration { resulting in a small hardware cost.
Note that a multiplication takes two time steps and an addition takes one step.
The loop body shown in Appendix A realized the full resource utilization: in odd steps in the loop body of the schedule, eight multiplications and four additions are issued in parallel; in even steps, only four additions are issued since all eight multipliers are continuing with the multiplications issued in the preceding odd steps.
As calculated beforehand, in every iteration of 16 steps, = w 0 + w 1 + w 2 = 19 sample outputs are produced, yielding a sample rate of (4p a + 3)=16 = 19=16 outputs per step, or equivalently, an average processing time of 16=(4p a + 3) = 16=19 step(s) per sample output. The loop body of the schedule consists of 64 multiplications and 64 additions. Table 3 gives, for the rst nine FU congurations, the improvements in sample rate by our rate-optimal schedule over the column-sweep algorithm, Percolation Scheduling technique, and the Regular Schedule for computing the linear dierence equation (3) . The sample rate is the average number of sample outputs produced per time step by the schedule's loop body, since the lters run essentially forever (i.e., the problem size of the dierence equations can be seen as innity), thus the eect of startup code is negligible. The speedup of our rate-optimal schedule over those scheduling techniques is determined by dividing the optimal sample rate by the sample rate of the scheduling technique in comparison. All the scheduling techniques are compared on the same FU assumption that a multiplication takes two cycles (steps) and an addition takes one cycle. The column-sweep algorithm [8] achieves a sample rate of 1=3 on an FU conguration of two multipliers and two adders, but it gives no further improvement as the FU conguration scales up beyond two multipliers and two adders. The column-sweep algorithm is also the lower bound on execution time for the scheduling techniques proposed in [16, 2] in parallel evaluation of linear dierence equation of form (1) . Note that although all these algorithms are claimed to be optimal, they respect loop-carried dependences, i.e., they are not scalable for IIR lter computation. Similarly, the Percolation Scheduling technique [14] achieves a sample rate of 1=3 using two multipliers and two adders, but it gives no further speedup beyond this FU conguration since it parallelizes the code subject to data dependences. Finally, we compare our rate-optimal schedule with the Regular Schedule [23] , which is a scalable single-instruction-multiple-data (SIMD) schedule for parallel evaluation of band linear recurrences.
Practical Signicance and Implications
The novel features of the Harmonic-Scheduling-based technique include that (1) it established the relationship expressed in mathematical formula between parallel schedule for BLR's, resource constraints and performance of the schedules, (2) it provides capability for exploring design space of parallel schedules with schedule components, resource constraints and target performance as major parameters. In particular, our approach can be used to successively approximate time-optimal schedules implementing IIR lters for a given target architecture.
These features will facilitate examining in mathematical terms trade-os among dierent schedule designs for computations containing BLR's, while the previously published approaches do not provide these features. Table 3 : The performance improvements of our rate-optimal schedules over other well-known scheduling algorithms for 2nd-order order recursive lter section on rst 9 design congurations.
As an illustration, the parallel schedule derived in Section 6 is the rst that achieves the optimal sample rate (i.e., throughput) for second-order IIR lters. on a design with p a > 1 adders and p m = 2p a multipliers, assuming a 2:1 ratio of multiplication to addition time. Our scheduling method, which generates at compiletime rate-optimal schedules, extends to linear dierences of mth-order (m 1). All we need is to generalize the four conditions in the problem formulation to mth-order linear dierences similar to [24] . For the time equation, we generalize the cost function to (m + 1) types of schedule-components, which is used in constructing optimal schedules for mth-order linear dierences. The critical path condition can be generalized by calculating the critical path length of the nal computation in an arbitrary period composed of schedule-components of all types. The third condition can be formulated similarly as described in Section 6. The last condition can be generalized by expressing the number of additions and multiplications in terms of quantity of schedulecomponents and cost of each type in a period of the schedule for mth-order linear dierences.
Our technique can be used in several design situations. As an illustration, considering a 2:1 ratio of multiplication to addition time, the optimal schedule for second-order linear dierences shows that a design with p a > 1 adders and p m = 2p a multipliers is the minimum and thus the most economical for the sample rate determined by the performance formula, since it achieves full FU utilization. This nding is indicative of the general design trade-os to making design choices. When we are given a throughput (sample rate) requirement for a design, we can easily determine the minimum hardware (thus area) cost of a satisfactory design using the performance formula T pa;pm = 16N=(4p a + 3). When we are given a hardware (or area) constraints with an unbalanced number of p a adders and p m multipliers instead of p m = 2p a , we can eectively determine the maximum throughput for the design by letting p a = bp m =2c when p m < 2p a and using p a directly when p m 2p a in the performance formula. This is implied by a property of linear recurrences: equal numbers of multiplications and additions are alternating in the computation and this does not change as redundant operations are introduced, hence the maximum throughput of the computation is determined by the type of FU's whose quantity is below the ratio of the balanced conguration.
In general, the time for multiplications may not be twice that of additions. When the ratio of multiplication to addition time is an integer greater than 2:1, our method can be used to generate optimal schedules, provided that the problem formulation is properly tuned, i.e., adjusting the cost coecients for schedule components of dierent types and the critical path condition. When the ratio of multiplication to addition time is a real number greater than 2:1, the optimal schedule is given at the ceiling of the given ratio. In general, our method can also accommodate other realistic design constraints such as wiring delays and layout constraints, as a framework for generating optimal schedules for band linear recurrences. Our future work will investigate these additional issues.
The numerical stability and preservation of desirable lter characteristics with the Harmonic Scheduling based approach are also open for further investigation.
for k = 0 to no more sample input do each step in parallel with pa adders and 2pa multipliers 
