Modern use of FPGAs as hardware accelerators involves the partial reconfiguration of hardware resources as the application executes. In this paper, we present a polynomial time algorithm for scheduling reconfiguration tasks given a trace of actors (invocations of hardware kernels) that is both provably optimal and placement-aware. In addition, we will propose a dependence analysis to determine whether for each actor instance, a reconfiguration task is needed prior to its execution in hardware. A case study using the H.264 encoder is presented to compare our algorithm against the state-of-the-art heuristics.
Introduction
One of the key challenges in achieving real speedups using in FPGA-based reconfigurable architectures is that hardware reconfiguration of today's massive FPGAs can be very costly. The configuration cost need to be be amortized, so that all benefits of hardware acceleration may not be lost as the application has to wait for reconfiguration to complete. Configuration prefetching [7] seeks to address this problem by overlapping (partial) reconfiguration with the execution of the application in FPGA. However, a prefetch miss is costly because of the additional reconfigurations that may be needed to recover from the miss. Therefore, the scheduling of reconfiguration is crucial.
In this context, this paper solves the following problem. Given a sequence (trace) of actors (an invocation of a hardware module):
• Determine whether for a given actor in the trace, it is necessary to schedule a reconfiguration task before it.
• Compute the earliest possible time a required reconfiguration task may be scheduled. For the current technology, at most one reconfiguration task is typically allowed at any time. In essence, we will present a polynomial-time (in terms of the length of the trace and the number of distinct hardware modules) algorithm that schedules all the required reconfigurations such that the overall execution time (latency) of the given actor trace is provably minimized. To the best of our knowledge, this is the first time an algorithm of this nature has been proposed.
Preliminaries

Architecture model
We consider an architecture with one micro-processor that receives a trace of actors. For each actor, we assume a corresponding hardware accelerator module that may be loaded into the FPGA for subsequent execution by means of partial hardware reconfiguration.
Scheduling model
Example 2.1 Figure 1 shows an example of a given application consisting of a sequence of five actors (corresponding to four tasks) with data dependencies, and a given conflict relation concerning the shared use of FPGA resources. For example, when task B conflicts with C, this would mean that they share some common hardware resources on the FPGA which may be either I/O pins, memory resources (such as block rams), or slices. 
Figure 1. Example of actor trace
Assume for now that this conflict relation is given statically, i.e., no module relocation is allowed. Thus we know the conflicts between every pair of actors at compile time. More formally, we define an actor trace and the corresponding conflicts as follows: 1. Trace of actors: S a = (a 0 , a 1 , a 2 , . . . , a n ) with a i ∈ T, i = 0. T is a set of tasks, and T = N , where N is number of tasks.
Resource conflicts:
The relation C = {(T i , T j )|T i T j } denotes that the placement of T i conflicts with placement of T j . 3. Any actor a i ∈ S a can only be scheduled for execution on the FPGA if all its preceding tasks have completed execution. Furthermore, if the corresponding module is not in the FPGA, it needs to be loaded, i.e., the corresponding resources reconfigured, prior to execution.
Precedence Analysis
Before defining the scheduling problem, we need to distinguish three different types of dependencies: The first one, data dependencies, is obvious. The second is the conflict relation introduced above that is due to the sharing of FPGA resources among the hardware modules. Finally, the third kind of dependencies arise because some actors cannot begin execution until its corresponding configuration task is completed. In order to compute this, we first need to discuss the problem of reconfiguration task generation.
Generation of reconfiguration tasks
Definition 3.1 (True dependence) Given a sequence of actors S a . a i is called truly dependent on a j , written
True dependence is based on the intuition that, for an actor a i of task t ∈ T , not every occurrence of conflicting predecessors in the trace matters. It is the conflicting predecessor a k that is closest to a i that will have an impact on the reconfiguration decision for a i . Furthermore, a i must be the first actor of task type t in the trace subsequent to a k . Example 3.1 In Figure 1 , a 1 is truly dependent on a 0 but a 2 has no true dependence because it executes after another actor, a 1 , of the same task. Now, each first appearance of a task in a trace will also necessitate exactly one reconfiguration task. Hence, the set of required reconfiguration tasks S r = (r 0 , . . . , r l ) may be found by inspecting the given trace once. 1 Theorem 3.1 (Reconfiguration task instantiation) For an actor a i in a given trace S a , there needs to be a corresponding reconfiguration actor (task) r i if, and only if, ∃a j ∈ S a : a j ≺ a i . In other words, if there exists a predecessor a j in S a on which a i is truly dependent.
For each reconfiguration task r i , two additional dependencies must be created. First, each r i must complete before the corresponding actor a i starts executing. Second, for a j such that a j ≺ a i , reconfiguration task r i for a i cannot start earlier than the completion of a j on which a i is truly dependent on because r i affects the execution of a j . The two dependencies are shown by adding an outgoing edge from r i to a i and one incoming edge from a j to r i . Figure 2 shows both the set of reconfiguration tasks generated for the running example as introduced in Example 2.1 and the additional scheduling dependencies. 
Example 3.2
Figure 2. Dependence Relations
In summary, we have to consider the following three types of dependencies for scheduling after having all the required reconfiguration tasks generated:
• Sequential precedence:
The complete dependence relation is thus P = P s ∪P r ∪P c .
Minimizing the schedule length
Given the above, we are now in a position to state the scheduling problem formally. The following notation will be used throughout the paper:
• l(a i A feasible schedule is an assignment of end times f (a i ) and f (r i ), respectively, to every actor a i ∈ S a and reconfiguration task r i ∈ S r such that all the above mentioned precedence constraints are satisfied, i.e., ∀j such that
The aim of a scheduling algorithm for this problem is to find a feasible schedule where f (a n ) is minimized for a trace of actors S a = (a 0 , a 1 , a 2 , . . . , a n ). 
else length ← length +l(current A ); T ime ← l(current A); while H not empty ∧T ime = 0 do r ← ExtractMax (H); if l(r) < T then T ime ← T ime − l(r); else r.TimeRemaining ← r.TimeRemaining −T ; T ime ← 0; Insert (H, r);
else current A ← empty ; length ← length +l(an); return length ;
We shall now present the main result of this paper, namely a polynomial time, latency-optimal scheduling algorithm for actors and reconfiguration tasks that we call Modified List Scheduling (MLS). The algorithm assumes that reconfiguration tasks can be pre-empted. This is based on the way frame-based reconfigurable devices operate. Configuration for frame-based devices such as Xilinx FPGAs is achieved by writing a set of frames into the SRAM configuration memory of the device. It does not matter whether the reconfiguration process is carried out in 1, 2, or more phases as long as the affected area is not again rewritten by other module configurations in between. Also, the algorithm prioritizes reconfiguration tasks by the order of appearance of their corresponding actors in the actor trace.
The MLS algorithm is shown in Algorithm 1. It consists mainly of 2 passes through the actor trace. In the first pass, the algorithm finds true dependences between the actors and generate the corresponding reconfiguration tasks S r . To do this, we maintain a flag f t for each task t ∈ T and an index prev t . We traverse the trace from a 0 to a n . Assume that a i is the current actor. If flag f ai is true, a corresponding reconfiguration task r i will be created, and if prev t = −1, r i is to be preceded by actor a prevt (i.e. truly dependent on a prevt ). prev t = 1 when the reconfiguration task created is needed for the first occurrence of a i . Furthermore, we record all ready reconfiguration tasks in a heap data structure H, ordered by the relative appearance order of the associated actor in the actor trace. In order to perform preemptive scheduling of reconfiguration tasks, we maintain a TimeRemaining attribute for each of the tasks and this is initialized to the full reconfiguration latency required.
The second pass through the trace computes the actual scheduling time using preemptive scheduling of reconfiguration tasks. current A is the current ready actor. If there are no ready actors, we schedule a ready reconfiguration task r whose associated actor has the earliest appearance order in the actor trace. Otherwise, we schedule actor current A. In the time l(current A), we schedule as many reconfiguration tasks sequentially as possible to configure the FPGA in parallel with the execution of current A. However, the space given by the scheduled actor may not be enough for the TimeRemaining of r to fill up. Such r's are inserted back into H with updated TimeRemaining. The algorithm terminates when the last actor a n is scheduled.
Case Study
H264-encoder case study
We use a H.264 [10] encoder application as a case study of the effectiveness of our algorithm. Based on profiling, we identified 15 loops that take up most of the computation time in the application. The hardware implementation of these loops were synthesized using Xilinx's ISE. Table 1 shows the characteristics of the application using two actor traces obtained with the 15 loops. It shows the length of the actor traces and the number of unique patterns occurring within the trace. A pattern is a maximal acyclic sequence of actors that occurs repeatedly in the trace. Two patterns are considered different if they differ in at least one actor. In the shorter trace, we encode one frame while in the longer trace we encode two consecutive frames. The frames are 704 by 576 pixels in size. All the hardware modules are assumed to be running at a frequency of 50 MHz. 
Experiment Setup
To demonstrate the effectiveness of our approach, we compared it against three algorithms: two different online Least Mean Square Predictor, and a simple scheduler.
Simple Scheduler Instead of prefetching, the Simple Scheduler maintains a record of the current FPGA configuration and only schedules a reconfiguration on demand if the actor to be executed is not yet in the FPGA. It is reasonable to expect that any prefetching approach should do no worse than the Simple Scheduler. We therefore used the schedule length computed by the Simple Scheduler as the baseline for our comparisons.
Least Mean Square Online Predictor A (LMSA-a)
This is an online predictor that is similar to that described in [7, 2] . The Least Mean Square Filter is used as the predictor function. However, because the target FPGA architecture considered in our paper is different (their architecture [3] supports relocation and defragmentation), our approach does not use the priority function that is based on the configuration sizes and the different eviction policies. Rather, the hardware module evicted are those in conflict with the module currently being prefetched. Least Mean Square Online Predictor B (LMSA-b) This is a modification of LMSA-a. Instead of predicting the next hardware task, the algorithm predicts and attempts to prefetch the next task that conflicts in placement with the currently scheduled task.
Experimental Results
In order to show the effect of increasing configuration overhead on the schedule length, we ran experiments by varying the reconfiguration speed from between 1µsec to 20µsec per CLB column. Figure 3 shows the performance increase of the different approaches over the schedule produced by the Simple Scheduler. The threshold of the minimum average execution cycles between two conflicting hardware module is set to 1000 cycles for this experiment. We observe that as reconfiguration speed decreases, the performance gain achieved by all the approaches decreases. With a high reconfiguration overhead, execution just has to wait till reconfiguration completes. The single reconfiguration port also becomes a bottle-neck. Over the range of reconfiguration overheads we considered, the schedule produced by MLS outperforms the others in every case. At best, it can be 30 percent better than those produced by the other schemes.
Related works
Configuration prefetching [5, 6, 7, 2, 4] is one of the techniques proposed to reduce the reconfiguration overhead In [7] , the author described a prefetching technique for partially reconfigurable FPGAs, exploiting the overlap between hardware execution and reconfiguration. In particular, a Markov predictor was introduced for deciding on the next reconfiguration operation. An extension of the work was presented in [2] . Morphosys [8] presented a heuristic context scheduling for its coarse-grained reconfigurable architecture.
In most FPGAs and partially reconfigurable FPGAbased platforms such as the Erlangen Slot Machine (ESM) [1, 9] , the reconfiguration interface may be considered as just another resource. Hence, for applications that have static reconfiguration needs, resource-constrained scheduling techniques may be used to schedule FPGA resources and reconfiguration interface simultaneously [4] . In this paper, we consider traces of actors which are requests for hardware activations of complex tasks. Therefore, our work goes beyond earlier ones done mainly on simple unconditional data-flow graphs.
