Partial reconfiguration (PR) of FPGAs can be used to dynamically extend and adapt the functionality of computing systems by swapping in and out HW tasks. To coordinate the on-demand task execution, we propose and implement a Run-Time System Manager (RTSM) for scheduling software (SW) tasks on available processor(s) and hardware (HW) tasks on any number of reconfigurable regions (RRs) of a partially reconfigurable FPGA. Fed with the initial partitioning of the application into tasks, the corresponding task graph, and the available task mappings, the RTSM controls system operation considering the status of each task and region (e.g. busy, idle, scheduled for reconfiguration/execution, etc). Our RTSM supports task reuse and configuration prefetching to minimize reconfigurations, task movement among regions to efficiently manage the FPGA area, and region reservation for future reconfiguration and execution. We validate the correctness and portability of our RTSM executing an image processing application on two Xilinx-based platforms: ZedBoard and XUPV5. We also perform a more extensive evaluation of its features using a simulation framework, and find that -despite the technology limitations -our approach can give promising results in terms of scheduling quality. Since our RTSM supports also the scheduling of parallel SW tasks, we use it to manage the execution of the entire parallel Edge Detection application on a desktop; we compare the application execution time with that using the OpenMP framework and find that with our RTSM execution is 2.4 times faster than the unoptimized OpenMP version. When processor affinity optimization is enabled for OpenMP, our RTMS and the OpenMP are on par, indicating that the scheduling efficiency of our RTSM is competitive to this state-of-the-art scheduler, while supporting in addition the management of HW tasks.
Introduction
Reconfiguration offers the possibility to dynamically adapt the functionality of hardware systems by swapping in and out HW tasks. To coordinate resource management, loading and triggering HW task reconfiguration, and execution in partially reconfigurable systems with FPGAs, efficient and flexible runtime system support is needed [1] . To this end, several scheduling algorithms of various complexities have been proposed [2] . In this paper we propose and implement a RunTime System Manager (RTSM) incorporating efficient scheduling mechanisms that efficiently manage the execution of HW and SW tasks and the use of physical resources. We aim to execute a given application as fast as possible without exhausting the physical resources. Our motivation during the development of our RTSM was to find ways to design a versatile system under the strict technology restrictions imposed by the Xilinx PR flow and devices [3]:
• Static partitioning of the reconfigurable surface in reconfigurable regions (RRs).
• Reconfigurable regions can only accommodate particular hardware core(s), called reconfigurable modules (RM). The RM-RR binding takes place at compile-time, after sizing and shaping properly the RR.
• An RR can hold one RM only at any point of time, so a second RM cannot be configured into the same RR even if there are enough free logic resources for it.
The proposed RTSM can run on Linux Intel-x86 based systems with a PCIe FPGA board, e.g. XUPV5, or on embedded processors (Microblaze or ARM) within the FPGA, while it can be ported in other systems with different processors and FPGAs. Furthermore, with the appropriate changes it can also run solely on Linux based systems without an FPGA in to manage the available CPU cores only.
We validated the behavior of RTSM in three different fully functional systems: a ZedBoard Zynq SoC-based platform, an XUPV5 platform, and a desktop PC with a 12-core Intel Xeon E5 processor at 2.2 GHz. This also allowed us to asses the RTSM in three different CPU technologies: ARM, MicroBlaze, and Intel-x86. Also we evaluate extensively the RTSM with complex cases within a simulation framework that observes all the restrictions of partial reconfiguration technology. The present work extends our previous publication [4] and makes with the following contributions:
• A portable RTSM capable of scheduling both HW and SW tasks in PR FPGA-based systems and SW-only tasks with comparable results to the OpenMP API.
• Support for dynamic execution of complex task graphs, with forks, joins, loops and branches so as to minimize the restrictions on the application.
• Support multiple scheduling policies, such as relocation, reuse, configuration prefetching, reservation and Best Fit.
• Extensive evaluation of the proposed RTSM on three different systems and a comparison with industry standard OpenMP API in controlling SW-only tasks.
The paper is structured as follows. In Section 2 we discuss previous work in the field. In Section 3 we present the key concepts and provide details on the RTSM input and operation. Then, in Section 4 we evaluate the performance of RTSM in a simulation environment with complex test cases and in Section 5 we extend our evaluation and validating on real FPGA-based platforms. Finally, Section 6 summarizes our work.
Related work
There is an increasing interest in exploiting the advantages of using partial dynamic reconfiguration instead of full reconfiguration. The work in [5] was one of the first to study the use of partial reconfiguration in the high-performance computing domain. In one of the first research works on hardware task scheduling for PR FPGAs, Steiger et al. addressed the problem for the 1D and 2D area models by proposing two heuristics; Horizon and Stuffing [6] . In [7] , Marconi et al. inspired by [6] presented a novel 3D total contiguous surface heuristic in order to equip the scheduler with "blocking-awareness" capability. Subsequently, Lu et al. created a scheduling algorithm that considers the data dependencies and communication amongst hardware tasks, and between tasks and external devices [8] .
Efficient placement and free space management algorithms are equally important. In A work that did not consider Compton's paradigm and therefore followed the strict FPGA technology restrictions regarding partial reconfiguration is [12] . There the authors present a novel reduced data movement scheduling (RDMS) algorithm takes data dependency among tasks, hardware task resource utilization, and inter-task communication into account during the scheduling process. However their algorithm is not being tested on an actual FPGA and is evaluated by means of emulation. The authors in [13] studied an approach for hardware task placement and space management focusing on a resource-and configuration-aware floorplacement framework, using an objective function, based on external wirelength [13] . This work targets scheduling at compile-time.
Burns et al., in one of the first efforts to create an operating system (OS) for partially reconfigurable devices, extracted the common requirements for three different applications, and designed a runtime system for managing the dynamic reconfiguration On the field of programming models that target multi-core heterogeneous architectures a great impact had the OmpSs model presented in [17] . Apart from multi-core architectures, OmpSs can incorporate the use of OpenCL and CUDA kernels. OmpSs differs from similar programming models like OpenMP and MPI in the sense that they do not adopt a fork-join model. Instead, OmpSs has a thread pool were all the threads, to be used throughout the application, are present from the beginning.
The run-time management of hardware tasks in partially reconfigurable devices is interesting and very active [18] . The OpenPR toolchain [19] and the GoAhead frameworks [20] provide a solid base for further research into partial reconfiguration and reconfigurable run-time systems. Also many previous efforts have evaluated scheduling and placement algorithms on actual FPGA-based systems [21, 14] .
What seems to be missing are complete solutions that take into consideration all current technology restrictions. In [14] , the actual overhead of the scheduler compared to the execution time of each task is not calculated and also the reconfiguration time measured is the theoretical one, while the application execution is presented in a theoretical way. The run-time manager presented in [21] is able to map multiple applications on the underlying PR hardware and execute them concurrently and takes all restrictions in consideration; however the mechanics of the scheduling algorithm are simple and the overhead considerable.
The Run-time System Manager
Our proposed RTSM manages physical resources employing placement and scheduling algorithms to select the appropriate hardware processing element (HW-PE), i.e. a Reconfigurable Region (RR), to load and execute a particular HW task, or activate a software processing element (SW-PE) for executing the SW version of a task. HW tasks are implemented as Reconfigurable Modules, stored in a bitstream repository.
Key concepts and functionality
During initialization, the RTSM is fed with input, which forms the basic guidelines according to which the RTSM takes runtime decisions:
(1) Device pre-partitioning and Task mapping: The designer should pre-partition the reconfigurable surface at compile-time, and implement each HW task by mapping it to certain RR(s) [3] . This limitation was discussed in [1, 21] . (2) Task graph: The RTSM should be aware of the execution order of tasks and their dependencies; this is provided with a task graph. Our RTSM supports complex graphs with properties like forks and joins, branches and loops, for which the number of iterations is unknown at compile-time. (3) Task information: Execution time of SW and HW tasks, and reconfiguration time of HW tasks should be known to the RTSM; they can be measured at compile-time through profiling. A task's execution time might deviate from the estimated or profiled execution time so the RTSM should react adapting its scheduling
