Abstract
Introduction
In recent years, with the development of the partially reconfigurable FPGAs, hardware tasks can be loaded into (or removed from) the FPGA individually without interfering with any other tasks running on the same FPGA. In many cases, such systems have runtime constraints and the sequence of the hardware tasks is unknown in advance. In all on-line task placement algorithms designed for such systems, determining and maintaining the free space on the FPGA is the most time-consuming process. This fact validates the high demand of efficient algorithms to manage the free FPGA space.
Because most of the hardware tasks can be fitted in a rectangular shape, the free FPGA space is usually recorded as a set of rectangles. There are two types of rectangles: the non-overlapping rectangles and the maximum rectangles. In general, it is more time-consuming to maintain a set of maximum rectangles than a set of non-overlapping rectangles. Maintaining a set of maximum rectangles, however, increases the possibility to fit an arrival task on the FPGA [1] .
In this paper, we propose a novel algorithm to find the complete set of maximum free rectangles at runtime. The main contributions of this paper are:
• a new mechanism to find a complete set of maximum free rectangles on the FPGA;
• improved algorithm performance compared to other state of the art approaches.
In section 2, related work is presented. Then, we detail our algorithm in section 3. In section 4, we present the simulation results and evaluate performance of our and two previously proposed algorithms. Finally, we conclude this paper and discuss future directions in section 5.
Related Work
In 1999, Bazargan et al. [1] proposed their on-line task placement approach. This approach stores the free space of the FPGA as a set of non-overlapping rectangles and can achieve high speed but at the cost of low placement quality. In [3] , Handa et al. proposed an algorithm to find empty space on the FPGA. In their algorithm, the FPGA surface is modeled as a 2D array of configurable units, referred as "area matrix". Their algorithm starts with encoding the matrix. Thereafter, all maximum staircases are found based on the encoded information. Finally the maximum free rectangles are extracted from each maximum staircase. In [2] , Cui et al. used the same 2D FPGA surface model but with different encoding information. The authors defined MKE points to utilize the scanning process while looking for the maximum free rectangles. In [4] , Tomono et al. proposed an online placement approach, which takes the module connectivity to the reminder of the system into account. In their approach, the staircase algorithm [3] is reused to find the complete set of maximum free rectangles.
In all algorithms above two basic approaches to manage free space with rectangular shape can be defined: the tracing approach and the scanning approach. In the tracing approach, only nonoverlapping rectangles can be created and used. Because there is no overlap between any two rectangles, all geometry operations are limited to the current rectangle only. Algorithms using this approach, e.g. [1] , achieve shorter algorithm execution time, but low overall placement quality (based on the task rejection rate). Nonoverlapping rectangle based approaches can be unable to fit a new arrival task although there is enough space available as shown in the simple example depicted in figure 1(a) . However, when using the scanning approach, e.g. [3, 2] , whose output is the complete set of maximum free rectangles, the arrival task will be placed as shown in figure 1(b) . In the reminder of this paper, only the scan- Figure 1 . Allocation of an arrival task ning approach will be considered due to its higher placement quality. In this paper, we propose a novel algorithm using the scanning approach to find the complete set of maximum free rectangles. The details about the algorithm are presented in the next section.
Flow scan algorithm
The algorithm proposed in this paper is called Flow Scan (FS). The FS algorithm is characterized by fast FPGA free space management.
Definitions
In-edge and out-edge: For each placed task, its lower Y coordinate is defined as in-edge. The out-edge corresponds to the higher Y coordinate of the same task. The bottom and top of the FPGA area are defined as out-edge and in-edge by default respectively. The scanning flow direction is from in-edge to out-edge, as shown in the figure 3(d) . Rectangular well (RW): During the scanning process, some temporally rectangles without top lines are created, we define such rectangles as rectangular wells. Formed rectangular well (FRW): Any RW that can only be expanded upwards is defined as F RW as shown in the figure 3(d). If there are several RW s with the same X coordinates created during the scanning process, only the F RW is recorded, e.g. in the figure 3(d) , only F RW6 is recorded and the temporal RWtemp will be removed. Maximum free rectangle: It is defined as a rectangle whose top, bottom, left and right edge can not be expanded. It is abbreviated as (left, right, bottom, top) in this paper, e.g. (0, 100, 0, 20) for the maximum free rectangle available at the bottom of figure 3(d).
Data structure
In our algorithm we use linked lists to store the required information. We defined 4 different linked lists: general edge linked list (GELL), in-edge and out-edge linked lists (IELL and OELL), and rectangular well linked list (RWLL). A GELL node consists of the edge height at which one or more edges are present. In addition two edge counters are present to store the number of in-edges and out-edges on that height. A node of IELL or OELL consists of the height and the X coordinates of the edge, the expire time of the corresponding task, and a pointer to the GELL node which represents the same height. This pointer is used to updating the corresponding edge counter when a new edge is inserted or existing one is removed. RWLL stores all current F RW s. A RW node in the RWLL stores the lower Y and both X coordinates of the FRW. In figure 2 , the linked lists representing the situation as depicted in figure 
Flow scan processing
There are two basic scan procedures in the FS algorithm, the in-edge processing and out-edge processing. The in-edge processing happens when the scanning flow reaches an in-edge and the out-edge processing is called when leaving an out-edge. In the inedge processing, if a F RW is overlapped with an in-edge in the X direction, a maximum free rectangle is created by adding to the F RW a top line at the height of the in-edge. Only when the scanning process reaches an in-edge, the search for overlapped F RW s will start and if any found the maximum free rectangle will be created. In the cases F RWL < in-edgeL < F RWR or F RWL < in-edgeR < F RWR 1 , at most two new RW s can be created for the non-overlapping area within the F RW . If the width span (the length along the X axis) of a F RW is fully covered by an in-edge, no F RW will be generated. In the out-edge processing, only one 1 The F RW L represents the left side of the F RW and F RW R is the right side; similar considerations hold for the in-edge. new F RW is created. Its bottom has the same height as the outedge.
A simple example shown in figure 3 is used to clarify the process in the following. In the beginning, an initial F RW is created at the bottom of the 2D FPGA area. The bottom of this F RW is 0 and it covers the whole width of the FPGA area, as shown in figure 3(a) . The scan process will reach the in-edge of task 1 at height of 20 in the Y direction (shorthand At height = 20:), the initial F RW is overlapped with this edge in X direction, so it becomes a maximum free rectangle (0, 100, 0, 20). Thereafter, two new RW s are created for the non-overlapping area as explained above. Because both of them can only be expanded upwards, they are F RW s, as shown in figure 3(b) , the F RW1 and F RW2. This step is completed by recording the two F RW s into RWLL and outputting the one maximum free rectangle found: (0, 100, 0, 20). At height = 50: the out-edge of task 1 is met at this level, so the out-edge processing is performed, which creates a new F RW : F RW3 shown in figure3(c). At height = 60: the in-edge processing is initiated. Because the F RW2 and F RW3 are overlapped with the in-edge of task 2, two maximum free rectangles, (25, 100, 0, 60) and (0, 100, 50, 60), are found and generated. The F RW4, F RW5 and F RW6 are created and recorded for the non-overlapping areas. At height = 85 and height = 100 2 : the F RW7 is created at the out-edge of task 2. When reaching the top edge (100) all existing F RW s are transferred to maximum free rectangles with top at Y = 100. During the scanning process described above, totally eight maximum rectangles were found: (0,100,0,20), (0,100,50,60), (0,10,0,100), (0,50, 50, 100), (0,100,85,100), (25,100,0,60), (25,50,0,100) and (70,100, 0,100). 
Operations on Linked lists
There are two types of linked list operations used in our algorithm. More precisely, the linked list update and linked list search. The search operation checks all recorded edge nodes and finds all maximum free rectangles existing on the FPGA. The algorithm starts to search the GELL. When checking a node in the GELL,
ues. Thereafter, the algorithm searches the OELL or (and) IELL according to the values of the counters. When searching OELL and IELL, the FRWs are created and the maximum free rectangles are found as described in section 3.3.
The linked list update operation adds (deletes) edge nodes into (from) the lists and adjusting the edge counters to right value. All edge nodes are ordered in increased height order. When a new task arrives, two edge nodes standing for its in-and out-edges are created and added into IELL and OELL separately. Next, if the GELL already has a node characterized by the same height as that of any of the edges, the corresponding counter is incremented by 1. Otherwise, a new node reflecting that height is added at the right position. When a task completes its computation, the related two edge nodes in OELL and IELL are removed while the edge counters in related GELL nodes are adjusted using the pointers in the OELL and IELL nodes. If in a GELL node both edge counters equal '0', this node will be removed from the list. Otherwise, there are still other edges on this height.
Proof of completeness
In this section, we prove two theorems which guarantee that the FS algorithm finds the complete set of true maximum rectangles.
Theorem 1: Generated FRWs always start at an out-edge height and the set of FRWs created on any edge i is complete and correct.
Assuming the bottom of a F RW is not positioned on the height of any out-edge implies that this F RW can still be expanded downwards until it reaches an out-edge. This is contradictory to the definition of the F RW (a F RW can only be expanded upwards). This proves that all F RW s start from an out-edge.
Assume there is a missing or incorrect F RW created when our algorithm scans edge i. (A) in case of an out-edge (as shown in figure 4(a) ) the L and R are the left side and right side of the free space at out-edge i. In the out-edge processing, one F RW is created with X dimensions equal to L and R. If there is a missing or incorrect F RW , e.g. figure 4(b) , or its width span does not cover the full non-overlapping area. The bottom of F RWi is on the out-edge of Taskj. As described in in-edge processing, the RW s are created for the non-overlapping area from F W Rs (F RWj in our example) overlapped with the in-edge (in-edgei). If F RWi is missing implies that the F RWj created from the out-edge j is incorrect. More precisely, it does not contain the area that F RWi occupies (or a part of that area when F RWi is not correctly created). This contradicts with the proof about the F RW from an out-edge generation presented above. This proves the assumption above wrong.
Theorem 2:
There is a one-to-one relationship between the set of F RW s and the complete set of maximum free rectangles. Assume two F RW s have overlapping area, e.g. F RW and F RWm as shown in figure 4 (a) . This means that one of them can be expanded horizontally (F RWm in this example), contradicting with the F RW definition. So, any two F RW s can not overlap, proving each F RW as unique.
Assume a maximum free rectangle R is not from any F RW . 
Figure 4. Contradiction situation
So this rectangle can still be expanded, as shown in figure 4(c) . This is a contradiction with the definition of the maximum free rectangle as presented earlier. This proves that any maximum free rectangle is generated from a F RW . Assume there is a F RW that does not become a maximum free rectangle after the scanning process. This means there is no in-edge overlapped area above this F RW . This contradicts with the fact that the top border of the FPGA area is defined as the highest in-edge with width span equal to the FPGA area width. This implies that all F RW will become maximum free rectangles after the scanning flow is completed.
Overall, the second theorem describes the one-to-one relationship between F RW s and maximum free rectangles. So if the set of F RW s is complete and correct, the whole set of maximum free rectangles is found completely and correctly. Thanks to the first theorem which guarantees that the complete and correct set of F RW s is created. So, the FS algorithm finds the whole set of maximum free rectangles completely and correctly.
Experimental evaluation
We performed simulations in order to evaluate the performance of our algorithm(FS), the staircase algorithm [3] and the enhanced SLA (eSLA) algorithm [2] . The three algorithms were implemented in C, and evaluated under Linux 2.6 running on Intel Pentium(R) 4 CPU 3.00GHz with 2GB main memory.
In each simulation run, 10000 tasks were generated randomly. We integrated the three scanning algorithms in the same simple on-line placement algorithm. This placement algorithm uses first fit policy to find a suitable allocation for the arrival tasks from the set of maximum free rectangles generated by the scanning algorithm. Before each simulation, a tracing process using the placement algorithm equipped with one of the scanning approaches is performed. We saved the tracing output which contains the partitioning information during placement in a trace to ensure the three scanning algorithms used exactly the same partitioning during execution. This trace was generated as follows: we run the placement algorithm 10000 times and store a single task into the trace at each run. We start from the complete size of the 2D FPGA area (100x100 configurable units (CUs)) and we create each new task by using the output of the previous scanning algorithm execution, which corresponds to the current complete set of maximum free rectangles (the initial maximum free rectangle is the complete FPGA area). One of the maximum free rectangles is selected randomly. The size of the new task is randomly generated within the selected maximum free rectangle. Considering the arrival time each task is assigned a random number between [5..25] time units. In respect to the task life time 3 ranges were used: T250, T500, and T1000. For T250 the task life time is randomly chosen from the time interval [5..250], for T500 the [251..500] is used, and for T1000 [501..1000]. All the above information about the partitioning and the new tasks is saved in our trace. We used the generated trace to evaluate the scan time of the three algorithms. Please note that the selection of which algorithm is used to generate the trace above is irrelevant for our study because all maximum free rectangles are sorted in the same order and the first fit is used. Also please note that the FS algorithm aims to find the complete set of maximum free rectangles on the FPGA at runtime, it is not an online placement algorithm.
Execution time
The algorithm is executed every time when a new task arrives or one is removed. In our simulation with 10000 tasks, the three algorithms are invoked approximately 15000 times. In the figure 5, the average execution time of a single algorithm call and its execution time distribution are presented. As shown in the figure 5 (a), our algorithm has shortest execution time compared to the other two algorithms for all three task sets. The eSLA has the longest execution time in all simulations. The reason is that in both eSLA and staircase algorithms, in order to find all maximum free rectangles, the information stored in bigger number of CUs should be accessed. In addition, during the update process they have to adjust the information in all related CUs. In our algorithm, only the task edges are processed, which is a relatively smaller number. The worst case for our algorithm is when all edges of n placed tasks are located on different heights, implying 2n nodes have to be accessed. This makes the worst case complexity of our algorithm O(2n). On average our algorithm is 1.5x faster than staircase and 5x faster than eSLA respectively. In figure 5(b) , the distributions figure  5(b) , the highest point of the curve representing the FS algorithm, indicates that 50% of the algorithm calls (around 7500) complete in the time interval between 20µs and 40µs. For short task lifetimes ( figure 5(b) ) FS has execution times clearly concentrated on the left side of the graph. This is due to the fact that with short task life times a low number of tasks is present on the FPGA and the total number of edges to be processed by the FS algorithm is small. For all three ranges of life times, the density of the FS algorithm samples is higher in the shorter time periods compared to the other two algorithms, similar as the situation shown in figure 5(b) . 
Scanning load
In the staircase and the eSLA algorithms, all CUs are encoded. The algorithms use the encoded information to find the maximum rectangles. In our algorithm, we use linked lists to record information and to find maximum free rectangles. The scanning load is defined as the number of CUs (or nodes of linked list) have to be accessed during algorithm execution. As shown in the table 1, for each update, the staircase modifies at least 191 CUs and the eSLA minimum 160 CUs. In our algorithm in the worst case only 3 nodes will be added (deleted) into (from) the GELL, OELL and IELL respectively. During the scanning process (looking for the complete set of maximum free rectangles) the number of nodes need to be checked by our FS algorithm is much lower than the number of CUs need to be visited in the other two algorithms, as shown in the last collum of the table 1. The large number of CUs for the eSLA algorithm is many CUs are checked several times for different rectangles in the same scan iteration.
Conclusion and future work
In this paper, we proposed a new algorithm for finding the complete set of maximum free rectangles during online FPGA placement. Our experimental results have shown that the Flow Scan algorithm has better performance compared to state of the art algorithms providing the same functionality. In the future, our work will focus on: (i) integrating the Flow Scan algorithm in previously proposed on-line placement technique; (ii) apply the FS algorithm in task scheduling schemes.
