Abstract-Run-time reconfigurable computing extends the classic role of FPGAs towards processing elements which feature multitasking similar to a micro processor. The main advantage of this usage is based on the granularity of the reprogrammable device which can be optimally adapted to each task that has to be executed. Hence, scheduling for such a usage scenario becomes apparent, where a set of tasks with their specific area demands and execution time has to be executed. Obviously, the provision of a set of design alternatives for each task will allow for a higher utilization of the FPGA. Nevertheless, it is computational impossible to facilitate all the theoretically achievable design alternatives of one task. Thus, this paper presents an scheduling algorithm that reduces the number of design alternatives that are utilized for a schedule. Furthermore, a scheduling algorithm is presented which achieves optimization in feasible execution time.
I. INTRODUCTION The traditional application of reconfigurable devices like FPGAs is mainly characterized by prototyping systems and as hardware accelerator. Here, one specific task is implemented once on such a dedicated device which can be described as compile time reconfiguration (CRC). Whereas run-time reconfigurable computing (RRC) is featured by SRAM-based FPGAs [1] like for example the Virtex II and the Virtex II Pro from XILINX with a configuration memory divided into frames which can be reconfigured independently. Thus, a partially reconfiguration is supported, that allows for executing one task, while other tasks are removed or reloaded onto the FPGA. Hence, scheduling for such a usage scenario becomes apparent, where a set of tasks with their specific area demands, execution time, and deadline has to be scheduled on the FPGA, similar to the behavior of a real time operating system. The reconfiguration model of the reconfigurable device can be abstracted by a ID or 2D area model as shown in Figure 1 . In the ID model the device is divided into several slots which could be separately reconfigured. This model simplifies the scheduling mechanism and trades this simplification for a sub-optimal utilization of the given hardware area. The more complex 2D area models allows for placing the tasks at any free position on the reconfigurable device [2] . Partial reconfiguration is currently provided by FPGAs from XILINX. The Vertex II family provides reconfiguration of slots, thus a ID area model is supported. The procedure of reconfiguration is either accomplished by providing the configuration bitstream over the JTAG interface or is also achieved internally over the so called ICAP interface.
Nevertheless, both models expose themselves as NP hard scheduling problems, thus heuristic scheduling policies are to be considered like for example earliest deadline first (EDF). This paper presents a design space exploration technique which computes a set of different design alternatives and features a time efficient scheduling algorithm.
The rest of the paper is structured as follows: In Section II an overview of existing run-time reconfiguration techniques is given. Section III refers on the deduction of area-time tradeoffs. In Section IV the scheduling problem is formulated and mapped to a level strip packing problem and in Section V a scheduling algorithm is presented. Furthermore, in Section VI on results regarding execution time and optimization efficiency is reported. Finally, a conclusion ends the paper.
II. RELATED WORK
In the work of Panainte et al. [3] two scheduling algorithms are proposed with the target to minimize the FPGAarea. Those scheduling algorithms take also into account the time that is needed for re-configuration. Two scenarios are discussed: all operations have to executed on the FPGA as well as some operation could also be executed in software.
Steiger et al. [4] present an online real-time scheduling problem and two heuristic approaches namely the horizon and stuffing technique. Both techniques are integrated in an operating system for reconfigurable devices.
The benefits of the utilization of the FPGA by using three and five implementation variants, thus offering more flexibility to schedule the tasks has been recently shown by Danne and Platzner [5] . This work gives a theoretical gain on the utilization of an FPGA and is based on artificially generated tasks and their design alternatives which might not be achievable for real tasks sets. The authors have extended their work towards a reconfigurable operating system for a ID area model [6] .
III. DESIGN SPACE EXPLORATION Basically, the process of generating design alternatives regarding run time and area of a given tasks is based on code transformations like loop unrolling or tree height reduction:
Loop unrolling describes the reduction of execution time of a loop by implementing several instances of the loop body which can be executed in parallel [7] . We assume reducible loops with statically known iteration space. The unrolling factor U counts the number of duplicated loop bodies. We allow for a loop unrolling factor U for iteration counts M such that MIU is an integer. The original cycle count of the loop is defined by the cycle count of the longest path of the loop body multiplied with the number of loop iterations, computed with au = aU. For the simplification of the model we neglect an additional increase of the gate count caused by the added complexity of the controller implementation [8] .
Tree height reduction deals with different scheduling of the data operations inside of a basic block. Assume for instance a complete sequential scheduling opposing to an implementation featuring full parallelism.
Those transformations can be applied to the control flow graph (CFG) of the algorithm and the permutation of those independent transformations results in different realizations, that are called design points.
Based on the described code variants each BB is annotated with a set of k possible implementation types I(BB) = {(al,tl), (a2, t2), .. ., (ak, tk)}. Thus, the number of possible design points for a CFG, that consists of N basic blocks, is computed with DP = HN I I(BBi)l. Hence, it is obvious that for large N an exhaustive search for all possible design points is an infeasible technique. The problem of identifying optimal implementations can be generally described as multi-objective optimization problem [9] and has the following formulation min{f (x) } subject to x c S, constraints bi like maximum area, maximum response time, or maximum power consumption given by the requirements of the system. Those constraints, which can be grouped into a vector b = (bI, ... ., bn)T, define a set of inequalities
All possible realizations span the design space X = {X ...XM}. The following relations between two design points xl and x2 can be identified:
We use the relation for vectors in the following way. The relations greater and greater equal are defined symmetrically. As stated before it is not possible to find a single solution that would optimize all the objectives simultaneously. Therefore, we use the most common description of optimality for a multi-objective optimization given by Vilfredo Pareto [10] . and tabu search. Usually, in this techniques the optimization result is only one design point. In order to identify a Pareto front with these heuristics a multi start approach has to be chosen, which leads to a high running time of the algorithms. Most promising candidates for the identification of a Pareto front are algorithms that maintain a set of solutions and iteratively converge towards the Pareto optimal set of solutions. Algorithms that are best suited to this behavior are genetic algorithms [11] or particle swarm optimization [12] . For example the Figures 3(a), 3(b) , and 3(c) depict some examples of Pareto fronts that have been derived by problem specific genetic algorithm [13] . In Nevertheless, the number of design alternatives that are generated by this approach is far to high in order to be considered for scheduling. Therefore, a decision maker has to extract a small subset of design alternatives which supports the scheduling process optimally.
IV. SCHEDULING PROBLEM
A set of tasks T = {fT, T2, . .. T , n} is considered that has to be scheduled on a reconfigurable device. For Here, one entry of the set corresponds to the sum of cycle counts that are needed by the task variants that are scheduled to a slot. In addition to this optimization the constraint that for each task exactly one and only one task variant has to be scheduled must (6) , and V constraints (5) to be fulfilled. Hence, it is of paramount importance to limit the number task variants and slots in order to keep the complexity of this problem controllable.
V. SCHEDULING ALGORITHM
The algorithm starts with the determination of slots and task variants. We in order to cause an early violation of the constraint (5), thus improving performance of the algorithm. Cutting of the tree can be performed due to several stopping criteria:
. Violation of the constraints (6) and (5) (Figure 5 Label )).
. Current cycle count is less than an already determined optimum ( Figure 5 Label (a). . While traversing the tree a remaining cycle count CCrm and the cycle count that is needed to schedule the remaining tasks CCrmt (time minimal implementation) is maintained. If CCrmt < CCrm than cutting of the tree is performed ( Figure 5 Label (a)). For example see Figure 6 , is obvious that not the full range of design alternatives for each task has to be considered but only the task variants for each task that are closest to the determined slot sizes Smax, Smin, and Sav. For example like the three task variants that are chosen for task T2 in Figure 4 . The next step in the algorithm is the enumeration of solutions. The generation of solution can be accomplished with a depth first search of a decision tree ( Figure 5 ). In this decision tree each level corresponds to one decision variable, thus the depth of tree equals the overall number of task variants V that have to be scheduled. The order of the decision variables in the tree is chosen such that tasks with less variants are at a higher position than others As already mentioned the complexity of this scheduling problem (size of the decision tree) increases exponentially with the number of decision variables. Therefore, the aforementioned stopping criteria is extended to a heuristic. A Figure 7 , due to the fact that the number of decision variables in this algorithm grows with the square of the overall design variants V. Thus, already the optimization of a task set with V = 20 already exceeds several hours of computation time on a standard PC. Whereas, the execution time of the branch and cut algorithm stays beneath one hour. Further decrease of run time is achieved with the heuristic approaches. The optimization performance of the algorithm is depicted in Figure 8 . For the task sets tsi to ts7 all algorithm achieve equal minimization result. For the task sets ts7 to ts12 the algorithm hbranchcutO.8 deviates only up to 6% from optimization results that have been obtained with branchcut, whereas hbranchcutO.8 shows already deviations up to 20%.
As it has been mentioned earlier the decision variables within the tree are ordered in the way that decision variables that correspond to tasks with less variants are ordered at first. A comparison between ordered and not ordered decision variables has been performed where a performance gain of up to 20% has been observed, due to fact that the constraint (5) is violated earlier.
VII. CONCLUSIONS
The run time reconfiguration capability of FPGAs offers the possibility for the application as multi tasking device. The reconfigurable device can be utilized even more efficiently by offering several design alternatives. Such trade-offs for time and area can be effectively generated by evolutionary algorithms. Nevertheless, it is computational impossible to consider all the theoretically achievable design points for an exhaustive search for the schedule. An scheduling algorithm has been presented that reduces the number of design alternatives that are used for the scheduling. A depth first search algorithm is applied that constructs solutions in feasible time compared to a classical level strip packing formulation with comparable performance. With the extension to a heuristic algorithm execution time is further reduced to several seconds.
