This paper deals with scheduling periodic real-time tasks on reconfigurable hardware devices, such as FPGAs. Reconfigurable hardware devices are increasingly used in embedded systems. To utilize these devices also for systems with real-time constraints, predictable task scheduling is required. We formalize the periodic task scheduling problem and propose two preemptive scheduling algorithms. The first is an adaption of the well-known Earliest Deadline First (EDF) technique to the FPGA execution model. Although the algorithm reveals good scheduling performance, it lacks an efficient schedulability test and requires a high number of FPGA configurations. The second algorithm uses the concept of servers that reserve area and execution time for other tasks. Tasks are successively merged into servers, which are then scheduled sequentially. While this method is inferior to the EDF-based technique regarding schedulability, it comes with a fast schedulability test and greatly reduces the number of required FPGA configurations. 0-7803-9362-7/05/$20.00 ©2005 IEEE
INTRODUCTION
Real-time systems are embedded computing systems that must react within precise time constraints to events from their environment. Example application domains, as reported in [1] , include control of power plants, railway switching systems, automotive applications, flight control systems, robotics, telecommunication systems and many more. For most of these systems it is already common practice, or at least conceivable for the near future, to include reconfigurable hardware devices to implement computations.
Reconfigurable hardware devices, the most prominent one being the field-programmable gate array (FPGA), are general-purpose devices that can be programmed after fabrication. SRAM-based FPGA variants can be re-programmed arbitrarily often, opening up the way to FPGA-based multitasking. While real-time scheduling has been intensively studied for microprocessor based systems [1, 2, 3] , the investigated task scheduling and placement strategies for re- * Supported by the DFG Research Training Group 776 configurable hardware devices have mostly focused on non real-time application models [4, 5, 6, 7] . Most authors assume a 2-dimensional area model and partial reconfigurability, and treat tasks as relocatable rectangles which can be placed anywhere on the FPGA device. Placement and scheduling strategies in off-line and on-line application scenarios are considered, mostly optimizing cost functions such as the total make span or the average response time. To the best of our knowledge, [8] is the only related work considering FPGA real-time scheduling. There, problems of non-preemptively scheduling aperiodic tasks to the 1-and 2-dimensional area models are treated.
The practical realization of multitasking on current FPGA technology rises several issues: First, partial reconfiguration is often limited in practice by device architectures and insufficient tool support. Some FPGA families are not partially reconfigurable at all. Second, the issue of communication between tasks is rarely considered in the models used. Finally, most related projects require tasks to be relocatable, which might be difficult to achieve for modern FPGA architectures that are not fully homogeneous.
Our work differs in that we use full FPGA reconfiguration and focus on preemptive periodic real-time scheduling. The full reconfiguration model can be used on all SRAMbased FPGAs and can be realized using standard design implementation tools. Task preemption requires a runtime system to be able to save the state of a task and, later on, resume it. Concepts and implementations of preemptive execution environments on FPGAs can be found in [9] [7] .
The typical embedded reconfigurable target architecture is shown in Fig. 1 , and comprises an FPGA, a controller, memory, and various I/O devices. Besides the embedded software and data sections, the memory stores the configu-rations (i.e., the programming bitstreams) for the logic resource. For such an architecture, we are interested in devising scheduling algorithms for periodic real-time tasks respecting following objectives:
• high scheduling performance: We want to be able to generate feasible schedules for a wide range of task sets.
• efficient schedulability test: We want to quickly decide whether all tasks will meet their deadlines in a given schedule.
• small number of required FPGA configurations: We want to minimize the number of FPGA confi gurations which, in turn, minimizes the required amount of embedded memory.
In this paper, we present the formal modeling of the scheduling problem and two scheduling algorithms: EDF-NF and MSDL. EDF-NF is a straight-forward adaption of the EDF algorithm to our specifi c system model. While revealing remarkable scheduling performance, EDF-NF lacks an effi cient schedulability test and requires an unbearable number of FPGA confi gurations. MSDL comes with a test of acceptable effi ciency and keeps the number of required confi gurations small, at the price of a decreased scheduling performance. The basic principles of these two algorithms have been published previously [10] . This paper extends our initial ideas and includes as novel contributions i) the detailed analysis of the MSDL algorithm and ii) simulation experiments delivering a quantitative evaluation of the required number of confi gurations for both scheduling techniques.
THE SCHEDULING PROBLEM

Task and Resource Models
We consider a set of periodic tasks Γ. Each task T i ∈ Γ refers to some computation which has to be performed periodically. The instances T i,j of task T i are released with period P i . That is, the release time of instance T i,j+1 is given by r i,j+1 = r i,j + P i , where r i,j is the release time of instance T i,j . C i denotes the worst case computation time of task T i , which is the same for all of its instances. The fi nishing time of task instance T i,j is denoted by f i,j . In our model, we assume real-time tasks with deadlines equal to periods. Hence, the deadline of a task instance T i,j is given by the release time of the next instance, r i,j+1 . Finally, the amount of reconfi gurable logic resources a task requires is given by A i . We normalize all resource requirements to the available resource offered by the FPGA. Assuming that no single task requires more resources than available, we get T1T2T3  T2  T1T2T3  T1  T1   T1   T2   T3   T1  T3   T2   T1   T1,T2  T2,T3  T2,T3 T2  T1,T2 T1  T1,T2  T2 FPGA area The considered reconfi gurable hardware device offers a certain amount of computational resources, e.g., the confi gurable logic blocks of an FPGA, which is also referred to as the area of the device. We normalize this area to 1. The device can execute any set R ⊆ Γ of tasks simultaneously, as long as the amount of resources required by the task set does not exceed the available area, i.e., Ti∈R A i ≤ 1.
A running instance of a task T i can be preempted by another task T j before its completion and, later on, be resumed. More general, any set of running tasks R can be preempted to execute a new set of tasksR. Technically, the runtime system has to interrupt the execution of R and to save the contexts of all tasks T i ∈ R. Then, the FPGA is fully reconfi gured with a new confi guration including all tasks T j ∈R. When R is scheduled for execution again, the previously saved contexts of T i ∈ R are restored and R is restarted.
The time for the preemption and restore processes is neglected in our scheduling analysis. For current FPGA devices, these times are in the range of a few to a few tens of milliseconds. We currently assume task execution times of at least one order of magnitude higher than that, and intend to model preemption overheads in future work.
As an example, Fig. 2 displays a possible schedule for the task set shown in Table 1 . The upper part of Fig. 2 indicates the release times and deadlines for the tasks, as well as the running tasks. The lower part of Fig. 2 illustrates the tasks' areas and the sharing of the FPGA area over time.
Overall, four different FPGA confi gurations are needed for this schedule. The schedule shown can easily be proven feasible, because every task instance meets its deadline for the entire hyper-period of the task set (which amounts to 12 time units). The hyper-period is the least common multiplier of all task periods in the task set. A feasible schedule defi ned over the hyper-period can be repeated an infi nite number of times without any missed deadline.
Formally, a schedule for the task set Γ assigns a set of running tasks R k ⊆ Γ to every point in time k, such that Ti∈R k A i ≤ 1. No instance of a task must start execution before its release time. We call the schedule feasible, if each task instance fi nishes its execution before its deadline, i.e., ∀i, j : f i,j ≤ r i,j +1 .
Utilization Metrics
We defi ne two utilization metrics to measure the computational load generated by a task set Γ. These metrics are central to the scheduling algorithm proposed in Section 4. Similar to the processor utilization factor defi ned in single processor real-time scheduling, we defi ne the time-utilization factor of a task set Γ to be U T (Γ) = Ti∈Γ Ci Pi . For the special case that all tasks are executed sequentially, U T is the fraction of time the FPGA spends executing tasks whereas 1 − U T is the idle time. While such a sequential schedule can mean an enormous waste of resources, it has two advantages. First, it allows to rely on effi cient schedulability tests known from single processor scheduling. Second, the number of required FPGA confi gurations is bound by the number of tasks.
Improved scheduling techniques will try to better utilize the FPGA resources and execute several tasks in parallel. To describe the computational load for such a situation, we defi ne as a more expressive metric the system-utilization factor of a task set Γ as U S (Γ) = Ti∈Γ Ci Pi A i . U S presents the fraction of the area-time product occupied by a task set. Visually, U S corresponds to the gray areas in the schedule of Obviously, we cannot fi nd a feasible schedule for a task set with U S > 1. Whether a feasible schedule exists for a task set with U S ≤ 1 depends on the specifi c relations among the task properties, in particular the area requirements A i . U T (Γ) and U S (Γ) are also defi ned for single tasks, as they are (minimal) instances of task sets. Table 1 shows the time and system utilization factors for the example tasks as well as for the complete task set.
As we cannot expect to fully utilize the FPGA area, the resulting system utilization will generally stay below 1. In this paper, we use U S to experimentally rate the quality of a scheduling algorithm. We do not attempt to derive bounds for U S that could be used to decide schedulability for a given algorithm.
EDF-NF SCHEDULING
We adopt the simple EDF strategy, which has been successfully used in single and multiprocessor environments, for our execution model and propose the scheduling algorithm EDF -Next Fit (EDF-NF). We use EDF-NF as an off-line scheduling procedure that precomputes a number of FPGA confi gurations which are dispatched at runtime. Similar to the original EDF algorithm, EDF-NF keeps a list of all released but not yet fi nished tasks in a ready queue. The ready queue is sorted by increasing absolute task deadlines. To determine the set R of running tasks, EDF-NF scans through the ready list. A task T i is added to the set of running tasks R, as long as the sum of the area of all running tasks remains less or equal to one. Whenever the next task cannot be added, EDF-NF proceeds in the ready queue and tries to add tasks with longer absolute deadlines. At this point, EDF-NF diverges from the pure EDF rule. The motivation for adding tasks in next-fi t manner is to improve the device utilization. If no more tasks can be added, the running set is closed and compiled to an FPGA confi guration. Whenever a new task instance is released or running instances of tasks terminate, the FPGA confi guration may change. To prove schedulability and generate the required FPGA confi gurations, EDF-NF simulates task executions and terminations for the complete hyperperiod. Unfortunately, to our knowledge there is no effi cient schedulability test. Further, the number of FPGA confi gurations can grow fairly large which is a major disadvantage of this algorithm.
SERVER-BASED SCHEDULING
In this section, we present a scheduling technique called Merge Server Distribute Load (MSDL). To construct a schedule MSDL uses the concept of server tasks, or briefly servers. A server is a periodic task that reserves execution time and FPGA area for other tasks. We defi ne a server as
. . } ⊆ Γ is a set of tasks for which execution time and area is reserved. P i , C i , A i denote the period, the computation time and the area of the server, respectively. The area of a server is set to equal the sum of the areas of tasks represented by the server, A i = T k ∈Ri A k . Consequently, whenever the server S i is running, all tasks it represents are running.
The rationale of the MSDL algorithm is to construct a set of servers Ω from the original task set Γ, such that any feasible schedule for Ω implies a feasible schedule for Γ. More specifi cally, MSDL constructs a set of servers Ω by properly merging tasks together for parallel execution. The resulting servers are then scheduled for sequential execution on the FPGA with single processor EDF. Feasibility for the resulting set of servers is thus effi ciently checked by the utilization test: U T (Ω) ≤ 1.
The Merge-server Distribute Load (MSDL) Algorithm
Algorithm 1 shows the pseudo code for the MSDL technique. First, each of the initial tasks is turned into a server Sx, Sy ← s e le c tV a lid P a ir T o M e r g e (Ω) 8: if no pair found then 9: return Ω exit 10:
Sz ← (Rx ∪ Ry, Py, Cy, Ax + Ay) Py ≤ Px
11:
Cx ← Cx − ta k e O v e r T im e (Sx, Sz) 12 :
Ω ← Ω ∪ Sz add server 13 :
if Cx ≤ 0 then 15: If no valid server pair could be found, the algorithm exits and returns Ω as the fi nal set of servers (line 9). Otherwise, the servers S x and S y are merged. Without loss of generality, we can assume that S y is the server with the shorter period. Then, a new server S z is created representing all tasks of the two original servers (line 10). The period and the computation time for S z are set to equal those of S y . Therefore, S z is a full replacement of S y , and S y can be removed from Ω. The computation time of S x is reduced, since the new server S z reserves area and computation time for the tasks of S x as well. The actual reduction of computation time depends on how often the new server S z executes within the period of S x . A pessimistic approximation for the reduction is given by:
As an example, we apply the MSDL algorithm to the example task set from Section 2. Then, in Section 4.2, we provide a more involved analysis to compute the exact computation time reduction. Table 2 shows the set of servers Ω * k generated in each iteration k of the MSDL algorithm. Initially, the servers 
In the fi rst iteration, S 1 and S 2 are selected and merged into S 4 . S 2 receives the new computation time C 2 ← C 2 − 2 = 3 . The server with the shorter period, S 1 , is removed. In the second iteration, the residual S 2 and S 3 are merged into S 5 . Not only the server with the shorter period is removed, but also S 3 since its computation time is reduced to zero. Ω * 2 is the fi nal server set, since neither R 4 , R 5 are disjunct nor A 4 + A 5 ≤ 1. As shown in Table 2 , the time utilization factor U T (Ω * 2 ) = 1. Consequently, Ω * 2 can be feasibly scheduled by EDF. The resulting schedule is shown in Fig. 3 . The fi gure also indicates the original tasks of Γ * executed inside the servers. Compared to the schedule given in Fig. 2 , MSDL requires only two FPGA programming fi les. Table 2 also lists the system utilization factor U S i which increases over the iterations, since larger servers will reveal more idle areas and times inside their reservations. In essence, MSDL trades system utilization for time utilization to allow for an efficient schedulability test and to reduce the number of FPGA confi gurations.
Computation time reduction
In Equation 1, we made the pessimistic assumption that a server S z with P z ≤ P x executes only for m = P x / P z −1 times between the release time and deadline of server S x . Therefore, the computation time of S x was reduced by mC z . A further reduction of C x is possible, if we take into account that the server instances of S z which are not fully contained between the release time and deadline of server S x can still be useful. The precise amount of this reduction depends on the actual phase between S x and S z . Fig. 4 illustrates the two cases that have to be distinguished: Case A shows an example with P x = 9 and P z = 4 . Pessimistically, the server S z is guaranteed to execute m = 9 4 − 1 = 1 times between the release time and deadline of S x . As Fig. 4-A illustrates, there is one instance of server S z being e time units too early to be included in the considered period of S x , and one instance of server S z being l time units too late. In a worst-case schedule, the server S z executes at the beginning of its early instance and at the end of its late instance, resulting in some wasted computation time (denoted by black boxes) for S x . However, some amount of the computation time of S z may still be useful to execute S x , as denoted by the gray box of the fi rst instance of S z . Let δ = e + l denote the time of the early and late server instances which are outside the considered period of S x . δ can be computed by δ = (m + 2) * P z − P x . Let C e l denote the computation time of the early and late server instances, which is guaranteed to be within the considered period of S x . Then, C e l can be computed by C e l = m a x (2 × C z − δ, 0). Therefore, in case A the time C e l is exactly the amount by which C x can be reduced in addition to Equation 1. Fig. 4 illustrates the second case, where the server S z is executed m + 1 times within the considered period of S x . In this case, δ changes toδ = (m+3 ) * P z −P x and C e l changes toC e l = m a x (2 × C z −δ, 0), respectively. It follows that in case B the time by which C x can be reduced in addition to Equation 1 is given by C z +C e l . Since we have to consider the worst case (out of case A and case B) , the precise reduction of the computation time is determined by: 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16   0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16   0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  e  l   e  l   0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 Sx:
Case B of
Sz:
Sx:
case B:
case A: Fig. 4 . Case analysis of computation time reduction
EXPERIMENTAL RESULTS
To evaluate the scheduling performance of the EDF-NF and MSDL algorithms, we have created synthetic task sets but adopted the task area requirements from realistic FPGA designs reported in the literature. To generate random task sets with varying values for the system utilization factor U S (Γ) we have proceeded as follows: We have chosen tasks areas uniformly distributed from 20%, which is approximately the size of a Discrete Wavelet Transform design on an XILINX VirtexII XC2V3000 FPGA [11] , up to 40% which is about the size of an MPEG 2 Video Decoder on the same FPGA [12] . The task computation times and periods were chosen such that the time utilization factors U T (T i ) are uniformly distributed in [0.2, 0.4 ]. To create a benchmark task set Γ, tasks have been created one by one according to the parameters above, and added to Γ until a given limit on the task set's system utilization has been exceeded. These parameters result in task sets of approximately 10 tasks on average. The simulation result on a series of 1400 tests is shown in Fig. 5 , labeled "n=small". The Figure displays the percentage of feasibly scheduled task sets for MSDL and EDF-NF over the system utilization factor.
As expected, EDF-NF clearly outstrips MSDL in scheduling performance. EDF-NF is able to schedule about 50% of the task sets with a system utilization factor around 85% and accepts almost all task sets with U S less than 75%. In contrast to that, MSDL is able to schedule only few task sets with a U S exceeding 70%, and achieves an acceptance rate of 50% for task sets with a U S around 55%.
On the other hand, Fig. 6 demonstrates the key advantage of MSDL by displaying the average number of FPGA confi gurations. For an MSDL schedule, the number of confi gurations equals the number of servers, which is bounded by n. For an EDF-NF schedule, the number of confi gurations equals the number of different sets of running tasks R. We further took into account that for EDF-NF FPGA confi gurations can be redundant, i.e., task set R is a subset of another task set R and thus only one confi guration is needed. The resulting curve is labeled "without subsets". Fig. 6 shows that the number of FPGA confi gurations grows In order to generate task sets with more tasks, we have run a second test series, labeled " n=medium" in Fig. 5 . Here, smaller tasks have been used by distributing the areas in [0.1, 0.2 ] (e.g. a 256 point complex FFT uses 10% of the XC2V3000 area [11] ). The time utilization factors U T i have been equally distributed in [0.1, 0.2 ]. These settings result in task sets of approximately 40 tasks on average. MSDL performs slightly worse than on smaller task sets. For EDF-NF, however, we could not gain results as the EDF-NF schedulability test did not terminate in reasonable time.
CONCLUSION AND FUTURE WORK
We have discussed the problem of real-time scheduling periodic tasks onto FPGA computers and have presented two scheduling algorithms, EDF-NF and MSDL. EDF-NF performs much better than MSDL in the sense that it can generate feasible schedules for task sets with higher system utilization. The experiments, however, emphasized the two key benefi ts of MSDL. First, MSDL comes with an effi cient schedulability test. For larger real-time task sets that need a schedulability guarantee, EDF-NF is not an option. 2 Second, the number of required FPGA confi gurations is bounded by the number of tasks, which makes this approach also feasible for larger task sets.
Future work will concentrate on the development and evaluation of different heuristics for selecting servers to be merged in the MSDL algorithm. One goal is to further reduce the number of required FPGA confi gurations and, thus, lower the memory requirements. Moreover, we will incorporate the modeling of reconfi guration and read-back in our schedules to increase the accuracy of the results.
