This paper presents a scheduling technique used to optimize computation speed of loops running on architectures that may include pipelined dedicated processors. The problem under consideration is to find an optimal periodic schedule satisfying the timing constraints.
Introduction
This paper deals with automatic parallelization of computation loops performing an identical set of operations repeatedly. One repetition of the loop is called an iteration. A parallel implementation of the loop implies that each operation of the loop is mapped on a hardware unit at a given time, therefore the scheduling theory is used to find start times of these operations.
Problem Statement and Motivation
Cyclic scheduling deals with a set of operations (generic tasks) that have to be performed an infinite number of times [1] . This approach is also applicable if the number of loop repetitions is large enough. A schedule is called nonoverlapped if all operations belonging to one iteration have to finish before the next operations of the next iteration can start. If execution of operations belonging to different iterations can interleave, the schedule is called overlapped [2] . An overlapped schedule can be more effective especially if hardware units are pipelined. The periodic schedule is a schedule of one iteration that is repeated with a fixed time interval called a period (also called initiation interval). The aim is then to find a periodic schedule with a minimum period [1, 3, 4] .
The studied problem is motivated by an application of RLS (Recursive Least Squares) filter for active noise cancellation [5] . A design of such filter gets complicated, due to the representation of real numbers, calculated by arithmetic units in FPGAs. We consider two types of arithmetic libraries operating with real numbers. The first type is the logarithmic number system arithmetics, namely the High-Speed Logarithmic Arithmetic (HSLA) library [6] implementing multiplication, division and square-root operations simply as fixed-point addition, subtraction and right shift. Addition and subtraction operations require more complicated evaluation, hence only one pipelined addition/subtraction unit is usually available for a given application. On the other hand, the number of multiplication, division and square root units can be almost unlimited. For that reason, the scheduling of algorithms on HSLA can be formalized as cyclic scheduling of tasks with precedence delays on one dedicated processor. This problem will be called 1-DEDICATED in the rest of this article.
The second type is the floating-point library, namely FP32 by Celoxica [7] . It uses the widely known IEEE format to store the data. In this case, each unit requires an important number of hardware elements on the gate array, hence only one unit of each kind is usually available for a given application. Scheduling of algorithms on an FP32 can be formalized as cyclic scheduling of tasks with precedence delays on set of dedicated processors. This problem will be called m'-DEDICATED in the rest of this article.
In both architectures, rather complex arithmetic units are required. Therefore, scheduling of such dedicated HW resources has to carefully consider the algorithm structure, in order to achieve the desired performance of applications. Scheduling also helps to choose the appropriate arithmetic library prior to the algorithm implementation.
Beside of the period minimization we consider three secondary objectives: the overlap minimization (simplest objectives, which does not increase computation requirements), the makespan minimization (suitable for small number of iterations) and minimization of data transfers (objective considering hardware implementation). Further, we propose three extensions of our problem: start time related deadlines, multiprocessor tasks and changeover times.
In many practical applications, the task must be completed within a given time limit. Such constraint is usually represented as a classical deadline, related to the start of the schedule.
In this article we use more general constraint, called start time related deadline, i.e. deadline related to the start time of another task [8] . The start time related deadlines can be conveniently used to model not only the classical deadline, but also the task synchronization, the watchdog timer and other practical time constraints.
Not only the arithmetic units, but also the memory units, become limited resources that have to be taken into consideration in many practical applications of FPGAs. In such cases, some arithmetic units and some memory units, may be required at one moment. This problem is formalized as multiprocessor task scheduling.
Some of the recent FPGA devices have the capability of partial runtime reconfiguration [9] . The most important problem of dynamic reconfiguration, not seen in classical (i.e. static) design, is temporal interdependence of units to be reconfigured. The scheduling algorithm have to distinguish, whether and when the reconfiguration is performed. If the task is processed by a unit, which is not currently available in FPGA, the processing is charged by reconfiguration time.
This problem is formalized using changeover times.
The ILP based scheduling method, shown in this article, solves the problems mentioned above.
The optimal solution found by this approach was used to schedule problems like RLS filter [10] and it was further extended to handle imperfectly nested loops like FICMA algorithm [11] .
Moreover, such an optimal solution can be used to obtain performance metric for heuristic algorithms aiming to solve larger instances.
Related Work
If the number of processors is not limited, cyclic scheduling can be used to built a periodic schedule in polynomial time [1, 12] . Unfortunately, for a fixed number of processors the problem becomes NP-hard [3] . If all tasks have unit processing times, several special cases with polynomial time complexity exist [3] . Related terms to cyclic scheduling, used in the scheduling community, are modulo scheduling and software pipelining [13, 14] , usually used in the compiler community.
Periodic schedule is a particular case of the K-periodic schedule, where two consecutive iterations may not be the same, but the schedule of an iteration is repeated every K iterations (integer parameter K is called periodicity). Then the objective of the K-periodic scheduling is to maximize the computation rate, which is the ratio of K to period length [15, 16] (in the rest of this article we assume only the periodic schedule, since the gain of the K-periodic schedule is very small and the implementation of the K-periodic schedule is more complex when leading, for example, to larger FPGA design).
Existing methods for the scheduling of loops can be divided into heuristic approaches and methods using integer linear programming (ILP). The heuristics-based techniques do not guarantee optimal solutions but have much lower computing requirements making them applicable in code compilers. On the other hand, ILP is not a polynomial algorithm but for problems with reasonable size it finds an optimal solution in a reasonable amount of time. Moreover, the ILP formulation is convenient since it allows one to add additional constraints to the scheduling problem (e.g. number of available registers and processor communications) or to formulate rather complex objective criterion (e.g. combination of the overlap and makespan minimizations).
An overview of the heuristic approaches is presented e.g. in [17, 1] . Some heuristics use a retiming technique [18] which is based on the transformation of a scheduled loop to another one with a shorter iteration period by rearranging delays. Other approaches [19] [20] [21] use generalization of non-cyclic approximation list scheduling algorithms to cyclic problems. A comparative study of relevant heuristic modulo scheduling techniques is shown in [22] .
A common approach based on ILP uses a binary decision variable x it determining whether a computation of node i starts at time step t. Unfortunately, the size of such an ILP model (corresponding to time complexity) depends on the period length. The ILP approach with x it decision variable is used in [23] , where the additional objective is to optimize word-length of arithmetic units. In [24] , the ILP model is extended by automatic processor selection from a library and [2] applies the approach to architectures with interprocessor communication. A general solution for K-periodic schedules by ILP, where the number of variables does not depend on the period length is shown in [16] and extended in [25] . With respect to this work, our solution leads to a simpler ILP formulation with less integer variables.
Solution Outline
We propose a method based on integer linear programming to find an optimal periodic schedule.
With respect to the target applications, the method is designed for instances with long period.
Motivated by HSLA architecture [6] we formulate 1-DEDICATED problem where one dedicated processor represents a bottle neck of such architecture (i.e. addition/subtraction unit). with results presented in [16] . A design of RLS filter on HSLA is shown in Appendix A.
Overview on Basic Cyclic Scheduling
Operations in a computation loop can be considered as a set of n generic tasks T = 
where w is the period and s i denotes the start time of task T i in the first iteration, i.e. occurrence T 1 i . Therefore, the scheduling problem is to find start time s i for each task T i , since s i (k) can be simply deduced from (1). 
The example in Figure 1 (b) shows the data dependence graph of the computation loop shown in Figure 1 (a).
The aim of the Basic Cyclic Scheduling (BCS) problem [1] is to find a periodic schedule while minimizing the period w. This scheduling problem is simply solved when the number of processors is not limited, i.e. it is sufficiently large. Thereafter the period w is given by the critical circuit c in graph G. This is a circuit c ∈ C(G) maximizing the ratio
where C(G) denotes the set of cycles in G. Any schedule with a shorter period cannot be feasible assuming that the schedule is periodic.
Since the tasks are repeated every w clock cycles, the periodic schedule is entirely given by the scalar w and the vector of the start times in the first iteration s = (s 1 , s 2 , ..., s n ). An optimal periodic schedule can be provided in polynomial time, since the time complexity to find critical circuits is O(n 3 · log(n)) [27] .
Cyclic Scheduling of tasks with Precedence Delays on Limited Number of Dedicated Processors
The Basic Cyclic Scheduling problem, solved in polynomial time, assumes that the number of processors is not limited. When the number of processors is restricted, the problem becomes NP-hard.
Representation of Pipelining by Precedence Delays
The scheduling problem related to HSLA architecture is even different, since the processors are pipelined and not identical. Some tasks run on one pipelined dedicated processor and the remaining tasks run on an unlimited number of processors. Further, while using FP32 architecture [7] , we deal with a set of dedicated processors. Both architectures require a different model than graph G in the previous section where the length of edge is equal to the processing time,
i.e. l ij = p i . Therefore, we introduce a model based on, so called, precedence delays defined as follows: the length of edge e ij is greater than or equal to the processing time p i assigned to node
Therefore, the processor is occupied by task T i during the processing time p i , but task T j may start at least l ij clock cycles after the start time of T i . Therefore, related length l ij specifies the precedence delay from task T i to task T j .
Precedence delays are useful when we consider pipelined processors. The processing time p i represents the time to feed the processor (i.e. new data can be fed to the pipelined processor after p i clock cycles) and length l ij represents the time of computation (i.e. the input-output latency). Therefore, the result of a computation is available after l ij clock cycles.
Representation of Reduced Tasks by Precedence Delays
In the case of an unlimited number of processors, the problem with precedence delays is still solvable using a polynomial algorithm for BCS [1] . But the assumption of an unlimited number of processors is not satisfied for HSLA architecture where some tasks are assigned to one dedicated processor (the addition/subtraction unit). Therefore we formulate it as 1-DEDICATED problem, i.e. cyclic scheduling of tasks with precedence delays on one dedicated processor (at the end of this section we show that this problem is NP-hard).
The suggested formulation of the scheduling problem uses a reduction of graph G to a reduced graph G . All nodes (tasks) except the ones assigned to the dedicated processor are reduced.
These tasks, running on an unlimited number of processors, will be further called reduced tasks and corresponding processors will be called reduced processors. Therefore T , the set of tasks assigned to the dedicated processor, relates to the set of nodes of G .
Since we consider period w to be constant in each iteration of the scheduling algorithm (explained in Section 4), we can weight edges e ij in graph G by amplitude
Then new precedence constraints in graph G represented by edges e ij are given by the longest paths in original graph G weighted by a ij (w) (solved e.g. by Floyd's algorithm). There is e ij (the edge from T i to T j in G ) of amplitude a ij if and only if there is a path from T i to T j in G such that this path goes only through the reduced nodes. The value of the amplitude a ij is the length of the longest path from T i to T j in G weighted by a ij . 
Problem Complexity
The problem 1-DEDICATED is NP-hard, since Bratley's scheduling problem 1|r j , d j |C max (monoprocessor scheduling with release dates, with deadlines and without precedence constraints) [26] can be polynomially reduced (P-reduced) to it. Therefore, each instance of Bratley's problem can be P-reduced to an instance of our scheduling problem. The P-reduction is shown in Figure 3 . The independent task set of Bratley's problem is represented by nodes T 1 , ..., T n and their release dates and deadlines are represented using precedence delays related to a dummy task T 0 . The release date r j of task T j is the length of the edge e 0j from T 0 to T j and h 0j = 0. Assuming s 0 = s i = 0, Inequality (2) determines the restriction s j ≥ r j , which is effectively the only restriction given by the release date.
Edges from T i to T 0 represent deadlines. Let e i0 have the height h i0 = 1 and the length l i0 = w − d i + p i , where w is equal to value of the maximum deadline (the moment when the next iteration will potentially start). In the same way the deadline restriction, i.e.
obtained from (2) for each of these edges assuming s 0 = s j = 0.
We remember that Bratley's problem 1|r j , d j |C max was proven to be NP-hard by P-reduction from a 3-PARTITION problem [28] . 1-DEDICATED problem is NP-hard, since each instance of Bratley's problem can be P-reduced to an instance of 1-DEDICATED problem as shown above. Further, m'-DEDICATED problem is NP-hard, since 1-DEDICATED is its subproblem.
Solution of Cyclic Scheduling on Dedicated Processors with Precedence Delays by ILP
Due to the NP-hardness it is meaningful to formulate our scheduling problem as a problem of
Integer Linear Programming (ILP), since various ILP algorithms solve instances of reasonable size in reasonable time. The period w is assumed to be a constant in this section, since multiplication of two decision variables cannot be formulated as a linear inequality. First we demonstrate this method on 1-DEDICATED problem and then we generalize it to m'-DEDICATED problem.
Letŝ i be the remainder after division of s i (the start time of T i in the first iteration) by w and let q i be the integer part of this division. Then s i can be expressed as follows
This notation divides s i intoq i , the index of execution period, andŝ i , the number of clock cycles within the execution period. The schedule has to obey the two constraints explained in Subsections 4.1 and 4.2.
Precedence Constraint
The first constraint is the precedence constraint restriction corresponding to Inequality (2). It can be formulated usingŝ andq as
Hence, we have n e inequalities (n e is the number of edges in the reduced graph G ), since each edge represents one precedence constraint.
Processor Constraint for One Dedicated Processor
Processor constraints are the second type of restrictions. They are related to the dedicated processor restriction, i.e. at maximum one task is executed on one dedicated processor at a given time. The execution period, which is neither in the head nor in the tail of the schedule, contains all tasks even if they are from different iterations. Based on this observation, the processor constraints can be simply formulated usingŝ i (notice that the processor constraints do not depend onq i ). Two disjoint cases can occur:
(i) In the first case, we consider task T j to be followed by task T i (both are from arbitrary iterations) within the execution period.
The corresponding constraint is thereforê
At the same time,
The conjunction of (6) and (7) into one double-inequality is
(ii) In the second case, we consider task T i to be followed by task T j . To derive constraints for the second case, it is enough to exchange index i with index j in Double-Inequality (8)
Using simple algebraic operations we derive a form similar to (8)
Exclusive OR relation between the first case and the second case, i.e. either (8) holds or (10) holds, disables one to formulate the problem directly as the ILP model. Inequalities (8) and (10) can be reduced into one double-inequality, while using the binary decision variablex ij (x ij = 1 when T i is followed by T j andx ij = 0 when T j is followed by T i )
To derive a feasible schedule on one dedicated processor, Double-Inequality (11) must hold for each unordered couple of two distinct tasks. Therefore, there are (n 2 −n )/2 double-inequalities (where n is the number of tasks in the reduced graph G ), i.e. there are n 2 − n inequalities specifying the processor constraints.
Objective Function
Using ILP formulation we are able to test the schedule feasibility for a given value of w. In addition we minimize one of the following objective criterions.
Overlap Minimization
One of the simplest objectives is to minimize the iteration overlap by the objective function min n i=1q i . The summarized ILP model, using variablesŝ i ,q i ,x ij , is shown in Figure 4 . It contains 2n + (n 2 − n )/2 variables and n e + n 2 − n constraints. 
Minimization of Iteration Makespan
The advantage of ILP formulation is the possibility to formulate various objective functions.
For example, when N, the number of iterations, is small, it may be convenient to minimize the makespan of iteration, by adding one variable c max and n constraints of typê
Such a reformulated problem not only decides the feasibility of the schedule for the given period w, but if such a schedule exists, it also finds the one with the shortest tail.
Minimization of Data Transfers
Minimization In order to minimize the data transfers, we add one slack variable Δ ij to each precedence constraint (5) resulting at 
where (w · (q max + 1)) represents an upper bound on Δ ij and the objective is to minimize
Meaning ofq max is explained further in Section 4.8. Such a reformulated problem not only decides the feasibility of the schedule for the given period w, but if such a schedule exists, it also finds the one with minimal data transfers among the tasks.
Problem Extension by Start Time Related Deadlines
In many practical applications, task T i must be completed within given time limits. These con- A simple example is shown in Figure 5 . In this case we want to define maximal time from start time of task T 2 to start time of task T 4 to be 6 clock cycles and from start time of task T 3 to start time of task T 5 to be 4 clock cycles, which is modeled by start time related deadlines depicted by dashed edges in Figure 5 . An example of application is a two-channel filter (see technical report [30] ). The time delay between the task reading input data (T between reading the input data by the first channel and reading the input data by the second channel is restricted as well as production of data by both filters. The ILP model presented in Figure 4 can be simply extended as follows: d i , task T i deadline related to the beginning of the iteration, can be included into the ILP model aŝ
and d ij , task T i deadline related to the start time of task T j , as
The shift of iteration index between T i and T j is given by h ij . Both l ij and h ij are non-negative numbers. Please notice the similarity between the Inequality (16) denoting a start time related deadline and the Inequality (5) denoting a precedence relation. Hence we have n e inequalities ( n e is the number of deadlines) added to ILP model in Figure 4 .
A special case of constraint is when task T j must be scheduled exactly l ij clock cycles after task T i . This constraint corresponds, for example, to the access to a memory where task T i represents the memory reading operation and T j operation that processes the data. To avoid the use of temporary register for specific data transfer, time delay between these tasks is given by the minimal latency l ij of T i to T j . Such constraint is given by (16) where l ij = l ij and the sign of inequality is substituted by equality.
Generalization to a Set of Dedicated Processors
The above mentioned formulation of processor constraints for 1-DEDICATED problem allows The tasks assigned to different processors are not conflicting, i.e. they do not impose any restriction with respect to the feasibility of the schedule. Therefore Double-Inequality (11) is only considered for each couple of tasks assigned to the same dedicated processor. 
Solution of Problems with Multiprocessor Tasks
The multiprocessor tasks scheduling problem is an extension of m'-DEDICATED problem, where task T i is associated with a set of m i processors P i = {P Figure 4 by a slight modification of the processor constraints (11) .
The processor constraint between T i and T j in m'-DEDICATED problem is considered, when
.e. both tasks are dedicated to the same processor, otherwise it is omitted. Therefore, in problem with multiprocessor tasks, processor constraint between T i and T j is considered when P i ∩ P j = ∅, i.e. tasks T i and T j can not overlap if they require the same processor.
Solution of Problems with Changeover Times
This scheduling problem extends m'-DEDICATED problem by partitioning the set of tasks T into groups G 1 , . . . , G r . If a task T i from group G k is scheduled immediately after a task T j from different group G l , there is a changeover time r lk i.e. task T i starts at the earliest r lk time units after the finishing time of T j , otherwise r lk = 0.
The monoprocessor scheduling problem with changeover times can be also directly solved by a modification of the processor constraints (11) . Since the processor constraint restricts the overlap only between a couple T i and T j then processing times p i and p j can be considered different for different couples of tasks, i.e. different processor constraints. Let us consider task T i belonging to group G k and task T j belonging to group G l . Then processing time p i used in processor constraint (11) is substituted by p i + r kl and processing time p j is substituted by p j + r lk . For tasks from the same group the processor constraint (11) stays the same.
Elimination of Redundant Processor Constraints
The time requirements to solve the generated ILP model roughly correspond to the number of integer variables, therefore it is meaningful to reduce this number as much as possible. Not all processor constraints are necessary, considering precedence constraints (5). Dealing with cyclic scheduling, one can find tasks from distinct iterations in one period. Therefore we specify a priory a parameterq max = max{q i } given as an additional constraint to the ILP model in The necessity of Double-Inequality (11) for the couple of tasks T i and T j assigned to the same processor could be decided e.g. while using linear programming (LP) composed of constraints (17), (18) and (19), as explained below. If any feasible solution of this LP problem exists, then T i and T j can be in conflict (i.e. can be overlaped) and therefore corresponding double-inequality cannot be eliminated.
Tasks T i and T j overlap if and only if task T i starts before task T j completes AND task T j starts before task T i completes. Therefore when Inequality (18) and Inequality (19) hold, tasks T i and T j may overlap by at least one clock cycle assuming integer parameters of tasks. Inequality (17) represents all precedence constraints given by graph G in terms of start times.
where
The LP composed of constraints (17), (18) and (19) is called iteratively for all possible δ, the difference of the iteration index between tasks T i and T j within one period. If the polynomial time LP finds a feasible solution for any integer δ ∈ −q max ,q max , tasks T i and T j can eventually cause conflict on the dedicated processor and then the corresponding Double-Inequality (11) is necessary.
The mentioned LP model operates on integer valued solutions since the system of Inequalities (17), (18), (19) forms a totally unimodular matrix and all input parameters are integers [31] .
This elimination is performed in polynomial time and it leads to a decrease in the number of constraints and to a decrease in the number of binary decision variablesx ij .
Minimization of the Period
We recall that the goal of cyclic scheduling is to find a feasible schedule with the minimal period w. Therefore, w is not constant as we assumed in the previous section, but due to the periodicity of the schedule it is a positive integer value. Period w * , the shortest period resulting in a feasible schedule, is constrained by its lower bound w lower , for which the feasibility needs to be tested, and its upper bound w upper , which is feasible if at least one feasible solution exists.
Optimal period w * can be found iteratively by formulating one ILP model, mentioned in the previous section, for each iteration. These iterative calls of ILP do not need to be performed for all w between the lower and upper bounds, but the interval bisection method can be used, since w * is not preceded by any feasible solution (i.e. no w ≤ w * − 1 results in a feasible solution).
Therefore, there are at a maximum log 2 (w upper − w lower ) iterative calls of ILP.
Lower Bound
Lower bound w lower is constrained by w processsor load = max d∈ 1,m
sor load is given as the maximum sum of the processing times of tasks assigned to one processor. When the problem instance is free of deadlines, we can use Equation (3) to calculate w lower unlimited , related to the critical circuit of G (identical to the critical circuit of G). When deadlines are permitted, the problem instance does not need to be feasible even for an unlimited number of processors. The feasibility is constrained by the set of Inequalities (20), specifying precedence relations, and by the set of Inequalities (21), specifying start time related deadlines Therefore, w lower unlimited , the minimal feasible period for unlimited number of processors, is calculated using the LP formulation given by (20) , (21) while minimizing w. Solution of this LP formulation directly gives the optimal period of BCS problem extended by deadlines, since the period w is variable (due to the absence of the processor constraints).
Finally, w lower is equal to max(w processor load , w lower unlimited ), since the processor load in one iteration cannot exceed the period and the schedule of 1-DEDICATED problem or m'-DEDICATED problem cannot be executed faster than the schedule of BCS problem.
Upper Bound
The space of feasible periods w given by (20) , (21) may have its upper bound due to the deadlines (e.g. w = ∞ is not feasible for a graph with a circuit of start time related deadlines).
Therefore w upper unlimited , maximal feasible period for unlimited number of processors, is calculated using the LP formulation given by (20) , (21) while maximizing w.
The schedule found when calculating w lower unlimited allows two tasks of G to be processed at the same time, which may result in a conflict when both tasks are assigned to the same processor. But we can derive a new schedule by a simple serialization of the conflicting tasks, while obtaining w serial tasks .
Finally, w upper = min(w serial tasks , w upper unlimited ).
Optimal Schedule Search
Period w * , the shortest period of 1-DEDICATED or m'-DEDICATED problem is calculated with the interval bisection method. First, we run ILP model for w = w lower , if it is feasible then w * = w lower . If not, then we proceed with w = w upper . If it is not feasible then this problem has no solution due to deadlines, otherwise we continue with w = (w lower + w upper )/2 and so on.
The above mentioned method gives the feasible schedule of G , if one exists. The corresponding schedule of G is also feasible, since the tasks executed on an unlimited number of reduced processors comply only to the precedence relation constraints that are already included in the precedence delays of G . Figure 6 shows the schedule of the example depicted in Figure 1(a) , where the dedicated processor is shown on the bottom line. Expansion of tasks on reduced processors needs to satisfy precedence relations only; they can be simply executed in an optimal manner (e.g. as soon as possible) since there is an unlimited number of reduced processors available. 
Results
The presented scheduling technique was implemented and run on an Intel Pentium 4 at 2.4 GHz using non-commercial ILP solver tool GLPK [32] . From the resulting schedule, we automat-ically generated code in Handel-C language [33] . The complexity of the integer linear programs can be estimated using the number of integer variables in the ILP model but each change of the problem instance may lead to a significant change of algorithm computation time. The ILP model for 1-DEDICATED problem contains 2n + (n 2 − n )/2 variables and n e + n e + n 2 − n constraints before the elimination of redundant processor constraints.
The efficiency of the elimination depends on the input graph G and on the givenq max .
The ILP model for m'-DEDICATED problem before elimination is even smaller. It contains
Benchmarks
In this section we show the results on one application implemented in FPGA (see RLS filter in Appendix A) and several benchmarks (for more details see technical report [30] ).
One benchmark is the second order wave digital filter (WDF) [34] consisting of eight tasks, where each circuit c has e ij ∈c h ij = 1. The WDF 2chan is an extension of WDF consisting of two synchronized channels (two identical WDF filters) with 6 start time related deadlines.
Another benchmark based on WDF is WDF reconf considering one reconfigurable unit with Experiments are summarized in Table 1 where n denotes the number of tasks after reduction, Experiment van Dongen, containing only addition operations, was executed on one pipelined addition unit. w lower , the lower bound of period, is given by the sum of the tasks processing times.
Experiments Elliptic 1 and Elliptic 2 are both executed on one pipelined addition/subtraction unit and an unlimited number of multiplication units. In both of them, there are a lot of redundant inequalities eliminated due to strong precedence relations. In case of the Elliptic 2 , the optimization required longer CPU time since the unit utilization is higher. Experiment Elliptic 3 is an adaptation of Elliptic 1 with only one multiplication unit.
Experiment RLS 1 , executed on one pipelined addition/subtraction unit and unlimited number of multiplication units, corresponds to the application of RLS filter for active noise cancellation using HSLA library (outlined in Appendix A). Experiment RLS 2 was executed on the architecture corresponding to FP32 library [7] , i.e. one pipelined addition/subtraction unit, one pipelined multiplication unit and one pipelined division unit. Experiment RLS 3 was executed on architecture with one pipelined multiplication and one pipelined division unit and an unlimited number of addition/subtraction units.
Comparison With Binary Method
To compare our method with ILP method using one binary decision variable for each task and each clock cycle within the period (further called binary method) (e.g. [2] ), we measure average CPU times for randomly generated instances of 1-DEDICATED problem. In this benchmark elimination of redundant processor constraints is not used.
Randomly generated graph G consists of 2/3 · n edges of height h ij = 0 and 1/2 · n edges of height h ij > 0. The height h ij > 0 was chosen from a uniform distribution on the interval 1, 2 (and rounded towards nearest integer). The outdegree of nodes in generated graph G was restricted to 3. All tasks (corresponding to nodes) were associated with one dedicated processor with processing time p i = 2. To demonstrate influence of precedence delay on CP U time, we performed two benchmarks, one for l ij = 4 (see Figure 7 ) and one for l ij = 6 (see Figure 8) . instances with greater precedence delays the two methods differ even more.
Conclusions
This paper presents an ILP-based cyclic scheduling method used to optimize computation speed of loops running on architectures including dedicated processors (each dedicated processor executes a disjoint set of tasks) and unlimited number of processors (these processors execute a set of operations disjoint with the ones executed on dedicated processors).
Our approach uses the graph reduction when operations executed on an unlimited number of processors are transformed to precedence delays. In order to exploit the graph structure we use polynomial time algorithms to simplify the optimization process. Namely, either graph algorithms calculating critical circuit or LP is used to find the lower bound of the optimal period.
Further, we use LP to eliminate redundant inequalities and binary decision variables. Then an optimal periodic solution is searched iteratively using interval bisection method.
As shown in the experiments, w * is very close to w lower , since the critical circuit represents the main constraint for cyclic scheduling. This effect is quite important (see columns w * and w lower in Table 1 ), therefore it is a question whether simple incrementing of period, when starting from w lower , is not better than interval bisection method.
The advantage of the ILP model presented in Figure 4 in comparison with common ILP models used for similar problems is that the number of variables is independent of the period length. The solution to one dedicated processor (called single functional unit) problem, shown in [16] , also does not depend on the period length, but our solution is more efficient. Namely, the solution in [16] requires in addition binary variables induced by linearization of the multiplication by the period w. In our solution we search the value of w iteratively within a finite number of steps,
and therefore w appears as a constant in the ILP model and it does not introduce any additional decision variable. Consequently, before the elimination of the redundant processor constraints our solution has less decision variables.
Our method of algorithm modeling, transformation and scheduling for FPGAs is fully automated. Therefore, it is easily incorporated in design tools (e.g. Handel C code generation from the resulting schedule) while representing considerable simplification for rapid prototyping.
Moreover, we have shown how multiprocessor tasks, changeover times and minimization of data transfers can be considered in the method.
The results of our scheduling method applied to the RLS filter design are better than the ones achieved by experienced FPGA programmer (see results in [10] ). For a given sampling period of 1/44100 second, the filter was able to process 129 iterations by our method, in contrast to manual design achieving 75 iterations. This acceleration by 70% is mainly due to the schedule overlap (operations belonging to different iterations are interleaving) and pipelining, which is rather difficult to achieve in the manual design.
Acknowledgement
This work was supported by the Ministry of Education of the Czech Republic under research programme MSM6840770038 and by the ARTIST2 Network of Excellence on Embedded Systems Design IST-004527. The authors would like to thank the anonymous referees for providing many invaluable comments and suggestions that lead to significant improvement of this paper.
Appendix

A Application
RLS filter's algorithm is a set of equations (see the inner loop in Figure 9 (a) solved in an inner and outer loop. The outer loop is repeated for each input data sampled every 1/44100 second.
The inner loop iteratively processes the sample up to the N-th iteration. The quality of filtering increases with increasing N. All iterations of the inner loop need to be finished before the end of the sampling period when the output data sample is generated and the new input data sample starts to be processed.
The specific inner loop of the RLS filter is shown in Figure 9 (a) where the task labels are next to the arithmetic operations. Figure 9(b) shows the corresponding G , the graph after reduction.
The schedule presented in Figure 10 was found by the first call of the ILP model for w * = 26
(the same period as w lower given by the critical circuit). The real-time demo application implemented on a Celoxica rc200e development board (Chip xc2v1000-4, design clock 50 MHz) using 19-bit logarithmic number system arithmetic HSLA, reached 129 iterations of the inner loop on a sampling frequency 44100 Hz. More details on the application can be found in [10] . 
