Abstract-This work proposes an exact ILP formulation for the task scheduling problem on a 2D dynamically and partially reconfigurable architecture. Our approach takes physical constraints of the target device that are relevant for reconfiguration into account. Specifically, we consider the limited number of reconfigurators, which are used to reconfigure the device. This work also proposes a reconfiguration-aware heuristic scheduler, which exploits configuration prefetching, module reuse, and antifragmentation techniques. We experimented with a system employing two reconfigurators. This system can be easily implemented using standard FPGAs. Our proposed ILP model can lead to an overall improvement close to 30% compared to other approaches in literature while the heuristic scheduler obtains the optimal schedule length on 60% of the considered instances.
I. INTRODUCTION
Nowadays, reconfigurable hardware systems, FPGAs in particular, are receiving significant attention. At first, they have been employed as a cheap means of prototyping and testing hardware solutions, while nowadays it is not uncommon to even directly deploy FPGA-based solutions. In this scenario, that can be termed Compile Time Reconfiguration [1] , the configuration of the FPGA is loaded at the end of the design phase, and it remains the same throughout the whole application runtime. With the evolution of technology, it became possible to reconfigure the FPGA between different stages of its computation, since the induced time overhead could be considered acceptable. This process is called Run Time Reconfiguration (RTR) [1] . RTR is exploited by creating what has been termed virtual hardware [2] , [3] following the concept of virtual memory in general computers. When an application bigger than the available FPGA area has to be executed, it can be partitioned in m partitions that fit in that area and these will be executed into numerical order, from 1 to m, to obtain the correct result. This idea is called time partitioning, and has been studied extensively in literature (see [4] , [5] ). A further improvement in FPGA technology allows novel devices to reconfigure only a portion of its own area, leaving the rest unchanged. This can be done using partial reconfiguration bitstreams. The partial reconfiguration time depends on the FPGA logic that needs to be changed. When both these features are available, the FPGA is called partially dynamically reconfigurable. S. Ogrenci Memik is with Northwestern University, Evanston, Illinois (emails: seda@eecs.northwestern.edu).
This work provides the following contributions:
• an ILP formulation for the problem of minimizing the schedule length of a task graph on a 2D partially dynamically reconfigurable architecture to obtain optimal performance results; • a heuristic scheduler wich takes into consideration antifragmentation techniques for general task graphs, a mix between classical deconfiguration policies and antifragmentation ones, and the use of out-of-order scheduling to better exploit module reuse. In literature there exist only few exact models for the task scheduling problem on dynamically reconfigurable systems, [6] , [7] . The former was the first to model the communication between reconfigurable hardware components, by augmenting the execution time of a task to represent the communication delay necessary to move its results into the memory at the end of the computation. On the other hand, this model has the following limitations: it assumes that any number of reconfigurations can be performed simultaneously, it does not consider configuration prefetching because the reconfiguration delay is simply added to the task execution time and, furthermore, it requires the reconfiguration process of a task to start only after each predecessor task has terminated. In [7] , the authors present an ILP model overcoming these limitations, but their target architecture employs one-dimensional reconfiguration with a single reconfigurator. This architecture is very simple and limited compared to the capabilities of current reconfigurable architectures.
Most of the heuristic approaches proposed in literature consider an online scheduler that, starting from a task graph, [8] , or a set of independent tasks, [9] , obtains a schedule and a mapping of each task onto the FPGA. These approaches have been developed for runtime environments; further on, module reuse is considered only in a few cases. The approach proposed in this work is different, because the heuristic scheduler provides a baseline schedule that can be easily used and modified by a simple online scheduler to address runtime constraints. This paper will focus on the scheduling of tasks on partially dynamically reconfigurable FPGAs in order to minimize the overall latency of the application. Section II proposes a description of the problem, describing the architectural solution on which the proposed model has been based on. Section III describes the proposed ILP formulation and Section IV the heuristic scheduler. Section V presents a set of experimental results comparing the ILP results and the heuristic ones to the results of the model presented in [6] . Finally, the conclusions regarding the proposed approach will be addressed in Section VI.
II. PROBLEM DESCRIPTION

A. Architecture description
The modern FPGA devices exploit a technology that allows powerful reconfiguration features. First, it is possible to perform dynamic reconfiguration. Second, emerging technologies allow 2D reconfiguration increasing the designer's degree of freedom. The payback for this increasing freedom is the necessity of new tools capable of exploiting these features in an effective way. The possibility of having a 2D partial dynamic reconfiguration may lead to both better solutions for well known problems and feasible solutions for new problems. Figure 1 shows the differences between 1D and 2D reconfiguration. In a 1D scenario a module occupying only a column portion needs the reconfiguration of the entire column, while the same module in a 2D scenario can be reconfigured in much less area. When a portion of the FPGA has to be reconfigured, a specific file called bitstream is needed: this file contains the information concerning the next behavior of that portion of FPGA. The main characteristic of bitstreams is that they have a correlation with the operation they implement: once the bitstream is defined, the operation is defined too, while given an operation, there could exist more than one bitstream implementing it. Therefore it is possible to assign to each bitstream an attribute called type used to identify the operation implemented, the area occupied on the target architecture and the time needed to be configured and to be executed by that bitstream. The latest FPGA technology, such as the Xilinx V4 [10] , [11] and V5 [12], [13] families, allows 2D partial dynamic reconfiguration. At the same time, the complexity of the problem of minimizing the schedule length of an application by exploiting reconfiguration increases. Furthermore, thanks to multiple reconfigurator devices, concurrent reconfigurations can be performed, different modules can be configured simultaneously onto the FPGA. Let us define a set of reconfiguration features that have to be taken into account to define the schedule. Module reuse means that two tasks of the same type have the possibility to be executed exactly on the same module on board, with a single configuration at the beginning. The deconfiguration policy is a set of rules used to decide when and how to remove a module from the FPGA. Anti-fragmentation techniques avoid the fragmentation of the available space on board trying to maximize the dimension of free connected areas. Configuration prefetching means that a module is loaded onto the FPGA as soon as possible in order to hide its reconfiguration time as much as possible.
B. Formal problem description
The 2D reconfigurable device is modeled as a grid of reconfigurable units (RU) by representing rows and columns as two sets R = {r 1 , r 2 , . . . , r |R| } and C = {c 1 , c 2 , . . . , c |C| }: each cell represented by a pair (r, c) with r ∈ R and c ∈ C is made up of ρ u CLBs. Columns and rows are linearly ordered, by which we mean that r k is adjacent to r k±1 on the FPGA, for every 1 < k < |R|; the same property holds also for columns. The application is provided as a task graph S, P , which is a Directed Acyclic Graph (DAG). The tasks can be physically implemented on the target device using a set E of execution units (EUs), which correspond to different configurations of the resources (RUs) available on the device, therefore different bitstreams. In such a scenario, the reconfigurable scheduling problem amounts to scheduling both the reconfiguration and the execution of each task according to a number of precedence and resource constraints. Resource sharing occurs at EU level: different tasks may exploit the same RUs if they are executed in disjoint time intervals. Moreover, when they also share the same EU, they can be executed consecutively with a single reconfiguration at the beginning. Given any task s and any of its feasible EU implementations i, we assume that suitable algorithms exist to readily compute the latency l i,s , the size r i and the reconfiguration time d i . Therefore it is possible to define a function that specifies for each task:
• the EU on which it has to be executed;
• the position on the FPGA where to place the selected EU;
• the reconfiguration start time for the selected EU, or the possibility of reuse if possible; • the execution start time for the task. In this work the general problem has been simplified: for each task type there is only one available EU. In this way the problem does not lose much in generality, but becomes easier to solve. Since each EU is associated with a bitstream and due to the former simplification, the model works with a number of bitstream types equal to the number of task types.
III. ILP FORMULATION
We consider a 2D reconfiguration scenario: the sets C and R of RUs are respectively the set of columns and the set of rows of the FPGA. Therefore, all RUs have the same ρ u (conventionally, ρ u = 1). Each task must be assigned to a rectangular set of RUs and due to the possibility of having multiple reconfigurator devices, concurrent multiple reconfiguration may be exploited. We consider the following model.
A. Constants
• a ij := tasks i and j ∈ S are mapped onto the same EU if a i j = 1, otherwise, a i j = 0 (by convention, a ii = 1); • l i := latency of task i ∈ S. This amount of time is augmented to model also the latency necessary to store the results in memory; • d i := time needed to reconfigure task i ∈ S;
• c i := number of RU columns required by task i ∈ S, • r i := number of RU rows required by task i ∈ S, • N REC := number of reconfigurator devices available for the FPGA. The scheduling time horizon T = {0, . . . , |T |} is large enough to reconfigure and execute all tasks. A good estimate of |T | may be obtained via a heuristic.
B. Variables
• p ihkm := task i is present on the FPGA at time h and cell (k, m) is the leftmost and bottommost used by i; 
D. Constraints
We used the if-then transformation (see [14] ) to model the constraints marked with a * .
lap constraints A cell can be used by a single task at a time:
SINGLE CELL CONSTRAINTS A task cannot be present on the FPGA with different leftmost and bottommost cells:
CELL-ON-THE-RIGHT-AND-TOP CONSTRAINTS The leftmost column of task i cannot be one of the last c i − 1 columns; the same constraint has to be assumed for the last r i − 1 rows:
ARRIVAL TIME CONSTRAINTS The arrival time must not exceed the first time step for which p is 1:
LEAVING TIME CONSTRAINTS The leaving time must not precede the last instant for which p is 1:
A task is present on the FPGA in all time steps between the arrival and leaving time:
ensures that the leaving time is greater or equal to the last, in terms of time, 1 of a task. To ensure a task to exist on the FPGA in a single portion of time, the difference between the leaving time and the arrival time needs to be equal to the sum of all the 1s of that task. Since 3 ensures a single position, this constraint ensures that a task cannot be place and removed and then placed again:
h∈T k∈C m∈R
PRECEDENCE CONSTRAINTS Precedences must be respected:
TASK LENGTH CONSTRAINTS A task must be present on the FPGA at least for its execution time plus (if no module reuse occurs) its reconfiguration time (reconfiguration prefetching is allowed, that is the execution is not bound to follow immediately the reconfiguration): * h∈T k∈C m∈R
RECONFIGURATION START CONSTRAINTS Each task has a single reconfiguration start time or none (if it exploits module reuse):
Reconfiguration starts as soon as the task is on the FPGA:
RECONFIGURATION OVERLAP CONSTRAINTS At most N REC reconfigurations can take place simultaneously:
STARTING TIME CONSTRAINTS The starting instant is reserved, so that the FPGA is initially empty -see also (14) :
MODULE REUSE CONSTRAINTS A task can exploit module reuse only if in the time step preceding its arrival time an equivalent task occupies the same position:
DEFINITION OF THE OVERALL LATENCY
IV. NAPOLEON: A HEURISTIC APPROACH
Napoleon is a reconfiguration-aware scheduler for 2D dynamically partially reconfigurable architectures. It is characterized by the exploitation of configuration prefetching, module reuse and anti-fragmentation techniques. Algorithm 1 shows the pseudocode of Napoleon. First, it performs an infinite-resource scheduling in order to sort the task set S by increasing ALAP values. Then, it builds subset RN with all tasks having no predecessors. In the following, RN will be updated to include all tasks whose predecessors have all been already scheduled (available tasks). As long as the dummy end task S e is unscheduled, the algorithm performs the following operations. First it scans the available tasks in increasing ALAP order to determine those which can reuse the modules currently placed on the FPGA. Each time this occurs, task S is placed in the position (k, m) which hosts a compatible module and is the farthest from the center of the FPGA. Unused modules can be present on the FPGA because Napoleon adopts limited deconfiguration as an anti-fragmentation technique: all modules are left on the FPGA until other tasks require their space, in order to increase the probability of reuse. The farthest placement criterium is also an anti-fragmentation technique, that aims at favoring future placements, as it is usually easier to place large modules in the center of the FPGA [15] . The execution starting time is tentatively set to the current time step t, but it is postponed if any predecessor has not yet terminated (see Algorithm 2 with Reuse = true). The task is also moved from the available to the just scheduled tasks (subset SN ). When no further reuse is possible, Napoleon scans the available tasks in increasing ALAP order to determine those which can be placed on the FPGA in the current time step. The placement is feasible when a sufficient space is currently free or it can be freed by removing an unused module, and when a reconfigurator is available. If this occurs, the position for task S is chosen once again by the farthest placement criterium. The reconfiguration starting time is set to the current time step t and the execution starting time is first set to t + d S and then possibly postponed to guarantee that all the predecessors of S have terminated (see Algorithm 2 with Reuse = false). Thus, there might be an interval between the end of the reconfiguration and the beginning of the execution of a task (configuration prefetching). When all possible tasks have been scheduled, the set of available tasks RN is updated: Algorithm 3 does that by scanning the successors of the tasks in SN , which have just been scheduled, and determining the ones which must be added to RN . Finally, the current time step is updated by replacing it with the first time step in which Algorithm 1 Algorithm Napoleon(S,P)
Algorithm 3 Function newAvailableNodes(SN )
RN ← ∅ for all S ∈ SN do for all S ∈ successors(S) do if predecessors(S ) are all scheduled then RN ← RN ∪ {S } end if end for end for return RN a reconfigurator is available. Algorithm 1 does not report two optimizations to increase efficiency: if in the current time step all configured modules are in use, reuse is not possible and the first for loop can be skipped; if there is not enough available area to place any task, because no new placement is possible and the second for loop can be skipped.
V. EXPERIMENTAL RESULTS
A. ILP results
This section compares the optimal results obtained by the models proposed in [6] and in [7] to the results of the model described in Section III. The evaluation has been performed by scheduling ten task graphs of ten nodes on a FPGA with 5 columns and 5 rows. The instances considered have small task graphs and few columns and rows because the problem is NP-complete and the computation time grows rapidly. The task graphs have been generated in order to verify different behaviors of the models: task graphs with tasks of different types, high module reuse, high reconfiguration time. The optimal schedule lengths are shown in Table I . The first and second columns report the results of the proposed model, with 1 or 2 reconfigurator respectively, which correspond to a realistic scenario. The third and fourth columns report the results of the model proposed in [7] , once again with 2 or 1 reconfigurator. The former is marked by a * because the original model does not accommodate more than one reconfigurator, and, hence, we have extended it to support multiple reconfigurators. The fifth column reports the results of the model proposed in [6] , in which the number of reconfigurators is unlimited but the reconfiguration must immediately precede the execution and follow the end of the preceding tasks. It is possible to see that 15  17  15  17  18  Ten2  22  22  22  22  33  Ten3  16  16  16  16  25  Ten4  14  15  16  17  25  Ten5  21  21  21  21  28  Ten6  19  20  21  22  23  Ten7  20  20  20  20  28  Ten8  22  24  23  24  29  Ten9  26  26  26  26  32  Ten10  23  23  23  23  34 increasing the number of reconfigurator devices can improve the schedule length. This improvement is not assured because it is not always possible to hide completely the reconfiguration time. The model proposed in [7] is dominated by our proposed approach because it only allows 1D reconfiguration instead of 2D reconfiguration. Dominated means that every solution the model in [7] can find, our approach can find it too. Furthermore, our model can find explore a bigger design space thanks to the possibility of having 2D reconfiguration. The Fekete model, [6] , can obtain worse results because it does not exploit module reuse and configuration prefetching even if it has possibility of reconfiguring as many tasks as it needs at the same time. This is an interesting aspect of our proposed model: by modeling all the physical features recently introduced in reconfigurable devices, better results can be obtained.
B. Heuristic vs ILP
The ILP results show the gain achievable by a scheduler that takes into account all the features of 2D partial dynamic reconfigurable devices. In practice, however, it is impossible to rely completely on a schedule obtained by an ILP solver, because the solution time of the ILP solver can be too long. Hence, a problem-specific heuristic scheduler becomes necessary. Table II Ten1  15  17  15  18  Ten2  22  22  22  22  Ten3  16  16  16  16  Ten4  14  15  14  18  Ten5  21  21  21  21  Ten6  19  20  23  24  Ten7  20  22  22  25  Ten8  22  24  23  25  Ten9  26  26  26  26  Ten10  23  23  23  26 experiments shown in Table II are shown in Table III.   TABLE III  ILP AND HEURISTIC EXECUTION TIME   NRECS=2  NRECS=1 NRECS=2 NRECS=1  ILP model  ILP model  Napoleon  Napoleon  Ten1  27 days  26 days  546ms  477ms  Ten2  31 days  25 days  413ms  398ms  Ten3  30 days  26 days  566ms  513ms  Ten4  27 days  25 days  578ms  544ms  Ten5  27 days  28 days  456ms  401ms  Ten6  34 days  31 days  670ms  555ms  Ten7  25 days  25 days  590ms  487ms  Ten8  23 days  20 days  602ms  599ms  Ten9  39 days  29 days  489ms  433ms  Ten10  36 days  37 days  716ms  673ms It is possible to see from Table III that the heuristic is necessary in order to achieve results in a reasonable time. The heuristic has been tested on the same twenty instances used above; the results are reported in Table II . In 7 out of 10 benchmarks Napoleon yields the optimal solution when there are two reconfigurators. Increasing the number of the reconfigurator devices improves the results in two ways: on one hand, two reconfigurator devices may hide much more reconfiguration time, thus improving the optimum; on the other hand, the developed heuristic obtains better results with respect to the optimum.
These schedulers have been tested and compared on the following applications (useful to extract features from a large set of data) that have been selected from a popular data mining library, the NU-MineBench suite [16] : 1) variance application: it receives as input a single set of data and calculates the mean and the variance among the whole data set; 2) distance application: it receives as inputs two sets of data of equal size and calculates the distance between them; 3) variance1 application: it receives as input a single set of data and calculates the mean and the variance among the whole data set. The tasks graph is different than the former variance application, since it involves different task types.
In order to compare the heuristics and the ILP model, task graphs with 7 nodes have been scheduled, see Table IV , one for each application. These applications are massive computing applications were there are few task types and a lot of tasks available at the same time. The developed schedulers are effective in this case due to good management of the FPGA area, the module reuse and configuration prefetching techniques.
VI. CONCLUSION
The main goal of this work was to introduce a formal model for the problem of scheduling in a 2D partially dynamically reconfigurable scenario. The proposed model takes into account all the features available in a partial dynamic reconfiguration scenario. The results show that a reconfigurationaware model can strongly improve the solution. The second goal of this work was to propose a heuristic reconfigurationaware scheduler that obtains good results, with respect to the optimal one, but in a much shorter time. In fact an ILP solver takes a very long time to solve the problem exactly, while the heuristic algorithm reaches a good solution in a very short time. The results prove that Napoleon can be used effectively as a baseline scheduler in an online scenario. The next step in this work is to develop an online scheduler that starting from the results obtained by Napoleon, finds a feasible schedule and mapping at runtime.
