Abstract-The recent emergence of 3D partially reconfigurable FPGAs implies that we need efficient online hardware task scheduling and placement algorithms for such architectures. However, the algorithms available in the literature for 3D FPGAs create a "blocking-effect". That is, these algorithms tend to make a wrong decision in finding a location of each arriving hardware task during runtime scheduling and placement on 3D partially reconfigurable FPGAs. This leads to currently scheduled tasks blocking future hardware tasks from being scheduled and satisfying their deadlines. We need to solve this problem to maximize the performance of partially reconfigurable runtime systems implemented using 3D chip technology. We propose a novel placement and scheduling algorithm with a blocking-aware heuristic to make better decisions at runtime. Based on evaluation using both synthetic and real workloads, our algorithm reduces deadline miss rate by 61% with 15% longer runtime overhead compared to state-of-the-art algorithms.
I. INTRODUCTION AND RELATED WORK
Recently we are witnessing the emergence of 3D FPGAs. Some of benefits of exploring 3D FPGAs compared to 2D FPGAs are reductions in wire-length [1] , delay [1] [2] [3] , channel width [2] , power dissipation [2] [3], energy consumption [1] , and an increase in logic density [3] . However, online task scheduling and placement algorithms for 3D FPGAs have not been well explored in literature. The only algorithm we are aware of targeting 3D FPGAs is presented in [4] ; whereas other existing work only target 2D FPGAs (e.g. [5] , [6] , [7] ).
The algorithms presented in [4] perform compaction in spatial and temporal domain. However, these algorithms do not have so called blocking awareness. That is, these algorithms schedule and place hardware tasks in a way that the currently placed tasks may block future tasks to be scheduled. As a result, many tasks miss their deadlines. We call this issue as "blocking-effect". To solve this issue, we propose an efficient algorithm with an awareness to avoid this effect.
The main contributions of this paper are:
• The first efficient blocking-aware online hardware task scheduling and placement algorithm targeting 3D partially reconfigurable FPGAs; • An extensive evaluation using both synthetic and real hardware tasks implemented on 3D FPGA.
The rest of the paper is organized as follows. In Section II, we introduce the problem of online task scheduling and placement targeting 3D partially reconfigurable FPGAs. Our proposed algorithm is presented in Section III. The algorithm is evaluated in Section IV. Finally, we conclude in Section V.
II. PROBLEM DEFINITION A 3D FPGA, denoted by FPGA (W, H, T H) contains W × H × T H reconfigurable hardware units arranged in a 3D
array, where each element of the array can be connected with other element(s) using the FPGA interconnection network. The reconfigurable unit located in i th position in the first coordinate (x), j th position in the second coordinate (y) and k th position in the third coordinate (z) is identified by coordinate (i, j, k), counted from the lower-leftmost coordinate (1, 1, 1) , where 1 ≤ i ≤ W , 1 ≤ j ≤ H, and 1 ≤ k ≤ T H.
A hardware task is denoted by T (a, w, h, th, lt, d).
The task arrives to the system at time a and requires a region of size w × h × th in the 3D FPGA (W, H, T H) during its lifetime lt. We define lt = rt+et where rt is the reconfiguration time and et is the execution time of the task. w, h, and th are the task width, height, and thickness respectively where
Online task scheduling and placement algorithms targeting 3D FPGAs have to find a region of hardware resources inside the FPGA for running each arriving task. When there are no available resources for allocating the hardware task at its arrival time a, the algorithm has to schedule the task for future execution. In this case, the algorithm needs to find the starting time t s and the free region (with lower-leftmost and upperrightmost corners (x 1 , y 1 , z 1 ) and (x 2 , y 2 , z 2 ), respectively) for executing the task in the future. The running or scheduled task is denoted as T (x 1 , y 1 , z 1 , x 2 , y 2 , z 2 , t s , t f ) where t f = t s + lt is the finishing time of the task. The hardware task meets its deadline if t f ≤ d. We call the lower-leftmost corner of a task as the origin of that task.
The goals of any scheduling and placement algorithm are to minimize the total number of hardware tasks that miss their deadlines and to keep the runtime overhead low by minimizing algorithm execution time. We define the deadline miss ratio as the ratio between the total number of hardware tasks that miss their deadlines and the total number of hardware tasks arriving to the system. The algorithm execution time is the time needed to schedule and place the arriving task.
III. PROPOSED ALGORITHM

A. Motivational Example
We first present a simple example to explain the idea behind our proposed blocking-aware algorithm. Let us assume that we have a task set as shown in Table I . Blocking-unaware algorithms do not have the ability to avoid choosing placements for current arriving tasks that will become obstacles for future arriving tasks to be scheduled earlier. As a result, the future arriving tasks will miss their deadlines. In this simple example, task T 3 prevents task T 4 to be scheduled earlier. The penalty for this wrong placement decision of T 3 is that the task T 4 misses its deadline as shown in Figure 1(a) .
To solve this problem, we introduce an algorithm that has awareness to avoid placements that will be obstacles for future tasks. This can be done for this example by placing task T 3 to a different location as shown in Figure 1 (b). Because of this better decision, the blocking-aware algorithm can avoid task T 3 from being an encumbrance for task T 4 to be started earlier.
By scheduling T 4 earlier, task T 4 now can finish execution earlier to satisfy its deadline constraint.
To equip the algorithm with the necessary knowledge to avoid the "blocking-effect", the algorithm places an arriving task at a location that hides it inside the previously scheduled tasks as much as possible. To discuss this idea more concretely, let us introduce some definitions here. A task T i is called a previous scheduled task PST (next scheduled task NST) of task T j if task T j (T i ) starts execution right after task T i (T j ) finishes. In this example, task T 2 is a previous scheduled task of task T 3 . The idea behind our proposed algorithm is to choose a position for an arriving task that overlaps as much as possible with its PSTs and NSTs. We quantify this overlap as hiding value later in our discussion. In this example, we can see that the placement for task T 3 in Figure 1 (b) has a higher hiding value than its placement in Figure 1 (a). Table I by blocking-unaware algorithm versus blocking-aware algorithm.
corners of (1, 1, 1) and
respectively. This information is also needed by the algorithm to limit its search region in finding the best position for each arriving task so as to lower the runtime overhead.
C. Conflicting Region with Scheduled Tasks
To avoid our algorithm placing tasks that can conflict with other tasks, we give it an awareness of conflicting region. The conflicting region of an arriving hardware task AT (a, w, h, th, lt, d) with respect to a scheduled task
is the region where the algorithm cannot place the origin of the arriving task AT without conflicting with the corresponding scheduled task ST . The conflicting region is defined as the region with its lowerleftmost and upper-rightmost corners at (max(1,
The algorithm exploits this information to lower the runtime overhead further.
D. Compaction Value
We attempt to reserve as much free region as possible in the middle of the FPGA to better accommodate future tasks. Hence the algorithm spreads hardware tasks close to the FPGA boundary. To quantize this choice, we introduce compaction value with FPGA boundary (CV F P GA ) as illustrated in Figure  2 . In this simple example, the arriving task AT at position (1,1,1) has three common surfaces with FPGA boundary, i.e., left, bottom, and front surfaces. Therefore, the compaction value with FPGA boundary is the sum of these common surfaces which can be formulated as CV F P GA = (w × h + w × th + h × th) × lt. As there are a lot of positions where the arriving task AT can be placed in the FPGA, we need to provide a general formula as formulated in Table II. To place each arriving task close to other scheduled tasks both in three-dimensional coordinates (compaction in 3D space) and in time coordinate (compaction in time), in addition to the compaction value with FPGA boundary, the proposed algorithm also computes a quantity called compaction value with scheduled tasks (CV ST ). A simple example of how to compute this value is shown in Figure 3 . In this example, the overlapped area between the bottom side of the arriving task AT and the top side of the scheduled task ST is the compaction value with respect to that scheduled task. As a result, CV ST can be computed in this case as
As our algorithm needs to compact tasks not only in the threedimensional domain but also in the time domain, the term min(lt, ((t f −t s )) is added in this computation. The placement position of AT can be in any free region of FPGA. So there are a number of ways in which AT can overlap with ST and we need a general formula to compute those values as presented in Table III . The sum of the compaction value with FPGA boundary and the compaction value with scheduled tasks is called as total compaction value and is formulated as CV = CV F P GA + CV ST (1). This value guides our algorithm to place tasks as compactly as possible in four dimensions (x, y, z, and time coordinates). y 1 , z 1 , x 2 , y 2 , z 2 , ts, t f ) 
E. Hiding Value
In addition to the above compaction, the proposed algorithm is also instrumented with an ability to maximize the hiding value as mentioned previously. The simple example of how to compute the hiding value is illustrated in Figure 4 . In this figure, the volume of the region with the upper-leftmost corner (x 1 , y 1 , z 2 ) and the lower-rightmost corner (x + w − 1, y + h − 1, z) is the hiding value that we are looking for. Therefore, the HV for this position can be formulated as HV = (x + w − x 1 ) × (y + h − y 1 ) × (z 2 − z + 1). We introduce the general formula for computing HV for any possible condition as HV = max(min(x + w − 1, 
F. Pseudocode and Analysis
The pseudocode of our blocking-aware algorithm, called 4D Compaction (4DC), is shown in Algorithm 1. Two linked lists (the execution list (EL) and the reservation list (RL)) are maintained. The EL records the information of all currently running hardware tasks sorted in order of increasing finish times; whereas the RL contains the information of all scheduled tasks sorted in order of increasing starting times. The information stored in the lists are the lower-leftmost corner coordinate (x 1 , y 1 , z 1 ), the upper-rightmost corner coordinate (x 2 , y 2 , z 2 ), the starting time t s , the finishing time t f , the task name, the next pointer, and the previous pointer.
In lines 1-29, the algorithm computes the 3D starting time matrix ST M (x, y, z) for the arriving task volume w × h × th inside the FPGA volume W × H × T H. This matrix records the earliest starting time for each potential position of the corresponding arriving task. The algorithm collects all possible positions that have enough space for the arriving task by scanning the EL and RL. The algorithm fills each element of the STM with the arrival time of incoming task a (lines 1-7). As shown, the algorithm only needs to create the STM matrix for acceptable region as presented previously to minimize runtime overhead. The algorithm updates groups of elements that are affected by all executing tasks in the EL (lines [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] and by all scheduled tasks in RL (lines 19-29). The algorithm only needs to update the affected elements limited by the conflicting region as defined before to reduce runtime overhead further.
In line 30, the algorithm collects all best positions (candidates) that have the earliest starting time (best starting time positions: best positions in terms of starting time) from the STM. Since the algorithm is not only designed to choose the best position for each arriving task in terms of starting time (time domain) but also the best position in terms of space (space domain), it needs to pack tasks compactly further. The algorithm computes the compact value CV (line 32) using formulas in Table II, Table III , and equation (1) and the algorithm chooses the best position from all the best starting time positions. For reducing runtime overhead further, the algorithm does not need to compute the compaction value for all positions, it only computes the compaction value for the best positions (candidates) (line 31). Intuitively, the highest compaction value gives the best position in terms of packing tasks in four dimensions as shown previously.
Besides the compaction value, the algorithm also uses the sum of finish time difference (SFTD) heuristic for all scheduled tasks that contacted in three-dimensional space with the arriving task (referred as VC set). The algorithm computes current SFTD (c SF T D =
33. The SFTD heuristic gives our algorithm an ability to group tasks with similar finish times to get large free volume during de-allocations. Moreover, to avoid "blocking-effect", it computes the hiding value in line 34 using equation (2) .
The algorithm chooses the position with the highest compaction value, the lowest SFTD value, and the highest hiding value for allocating the arriving task (lines 35-57). Allocating the arriving tasks at the highest compaction value compacts the tasks both in time and space; while grouping tasks with similar finish times creates more possibilities to produce larger free space during de-allocations. On top of that, the highest hiding value guides our algorithm to avoid "blocking-effect".
The algorithm allocates the arriving task when space is available for the task; otherwise, the algorithm needs to schedule the task for future execution. If the arriving task can be allocated at its arrival time (line 59), it will be executed immediately and added in the EL (line 60); otherwise, it is inserted in the RL (line 62).
When the tasks in the RL are executed, they are removed from the RL and added in the EL. The finished tasks in the EL are deleted after execution. These updating processes are executed when the lists are not empty (lines 64-69).
The time complexity analysis of our proposed algorithm is presented in Table IV where W , H, T H, N ET , N RT are the FPGA width, the FPGA height, the FPGA thickness, the for (x=max (1,x1-w+1) ;x≤min(x2,W-w+1);x++) do
10:
for (y=max (1,y1-h+1);y≤min(y2,H-h+1) ;y++) do
11:
for (z=max (1,z1-th+1) ;z≤min(z2,TH-th+1);z++) do
12:
if STM(x,y,z)<t f then
13:
STM(x,y,z)=t f (1,x1-w+1) ;x≤min(x2,W-w+1);x++) do
21:
for (y=max(1,y1-h+1);y≤min(y2,H-h+1);y++) do
22:
for (z=max (1,z1-th+1) ;z≤min(z2,TH-th+1);z++) do number of executing tasks in the EL, the number of reserved tasks in the RL, respectively.
IV. EVALUATION
A. Evaluation with synthetic hardware tasks
We have built a discrete-time simulation framework in C to evaluate our algorithm. The framework was compiled and run 
under Windows XP operating system on Intel(R) Core(TM)2 Quad CPU Q9550 at 2.83 GHz PC with 3GB of RAM. We generated 500 random tasks for each task set. Every hardware task has its arriving time, size (width, height, thickness), life-time and deadline. The task widths, heights and thicknesses are randomly generated in the range [5. .30] reconfigurable units. The life-times are also randomly generated in [5. .100] time units, while the inter-task arrival periods are randomly chosen between one time unit and 50 time units. Total tasks per arrival are randomly generated between [1..5] . Since the algorithms are online, the information about arriving tasks are unknown until their arrival times. The algorithm is not allowed to access this information at compile time. We model a 3D FPGA with 50x50x50 reconfigurable units.
Our algorithm is designed for 3D FPGA. For fair comparison, we only compare our algorithm with existing algorithms that target 3D FPGAs. Based on our literature survey, there are two heuristics used in [4] that focus on 3D FPGAs. These heuristics are called 3D adjacency and 4D adjacency.
To evaluate the proposed algorithm, we have implemented three different algorithm:
• 3D adjacency heuristic (3D Adj) algorithm [4] ; • 4D adjacency heuristic (4D Adj) algorithm [4] ; • our algorithm using blocking-awareness heuristic, called 4D Compaction (4DC) algorithm. The evaluation is based on two performance parameters defined in Section II: the deadline miss ratio and the algorithm execution time. The experimental results with synthetic hardware tasks are presented in Figure 5 . All results are presented as an average value from 10 runs of experiments for each task set. The relative deadline of a hardware task in this figure is defined as rd = d − lt − a. It is also randomly generated with the first number and the second number shown in the figure as the minimum and maximum values, respectively. The shorter relative deadline makes it more difficult for scheduling and placement algorithm to meet task deadlines. As a result, the deadline miss ratio increases as the relative deadline decreases.
The 3D Adj has the highest deadline miss ratio due to its spatial-only compaction. Since the 4D Adj performs additional compaction in time domain on top of its 3D spatial domain, it has better scheduling and placement quality compared to the algorithm using 3D adjacency heuristic, i.e, it has up to 21% lower deadline miss ratio than the 3D Adj. This figure also shows that our algorithm using blocking-awareness heuristic has the lowest deadline miss ratio, i.e, up to 26% and 11% lower deadline miss ratio compared to the 3D Adj and 4D Adj heuristics, respectively. This is due to the fact that our proposed algorithm is not only doing four-dimensional packing of hardware tasks, but it also has an ability to avoid "blockingeffect" as presented previously. By avoiding this effect, tasks can be scheduled earlier to meet their deadlines.
As the 3D Adj does not need to care about compaction in time domain, it has the lowest runtime overhead. 4D Adj is aware of time domain and it has 14% higher runtime overhead to choose the best position for each arriving task. Based on the fact that the algorithm using blocking-aware heuristic needs to avoid "blocking-effect", it requires 12% longer time to choose the best solution for arriving tasks compared to 4D Adj. 
B. Evaluation with real hardware tasks
To complete the evaluation, the algorithm is also evaluated using real 3D hardware tasks. The 3D FPGA implementation of MCNC [8] benchmark circuits obtained from [9] are used for this purpose. The results of the experiments are shown in Figure 6 . The figure shows that the superiority of our algorithm is not only applicable for synthetic tasks but also for real tasks. Evaluation with real tasks shows that our algorithm has 61% and 23% lower deadline miss ratio on an average compared to the algorithms using 3D and 4D adjacency heuristics, respectively. The proposed algorithm needs 15% longer runtime overhead to incorporate blocking awareness.
V. CONCLUSION
In this paper, we have introduced "blocking-effect" in online scheduling and placement of tasks on 3D partially reconfigurable FPGAs. To solve this issue, we propose a blocking-aware heuristic. The proposed heuristic is used to build a novel placement and scheduling algorithm supporting blocking-awareness, named as 4D Compaction. Because of its 4D compaction capability, the proposed algorithm places or schedules the arriving tasks more compactly on a 3D partially reconfigurable FPGA by doing four-dimensional compaction, i.e, both in 3D spatial coordinates and time coordinate. Moreover, the algorithm is equipped with an ability to group tasks with similar finish times to form larger free volume for better allocation of future tasks. Finally, the algorithm is armed with an ability to avoid "blocking-effect". The algorithm is evaluated using both synthetic and real workloads. Based on this evaluation, the proposed algorithm produces up to 61% better solutions with 15% longer runtime overhead compared to the state-of-the-art schemes.
