Real-time performance is the primary requirement for edge computing systems. However, with the surge in data volume and the growing demand for computing power, a computing framework consisting solely of CPUs is no longer competent. As a result, CPU+ heterogeneous architecture using accelerators to improve edge computing systems' computing capacity has received great attention. The type of accelerators determines the performance of the edge computing system largely. The accelerators include Graphics Processing Unit (GPU), Application Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA). FPGAs with its reconfigurability and high energy efficiency are widely used in many edge computing scenarios. Nontheless, the performance depends also on the scheduling efficiency between software tasks on CPUs and hardware tasks on FPGAs. Unfortunately, the existing strategies have not fully exploited the differences between hardware and software tasks, thus resulting in low scheduling efficiency. This paper proposes a task scheduling framework on the Dynamic Partial Reconfiguration (DPR) platform. We take full account of the characteristics of task switching overhead and predictable execution time of hardware tasks in DPR, and reduce the number of task-switching times and active tasks in the system, thus improving the scheduling efficiency. We conduct a set of experiments on the Zynq platform to verify the proposed framework. Experimental results demonstrate that when the execution time of the accelerator exceeds the reconfiguration cost by an order of magnitude, the efficiencies of all the cases are more than 98%, and the efficiencies can reach 90%-98% in the same order of magnitude.
I. INTRODUCTION
With the surge in data volume, the growing demand for computing power and the increasing concern about safety, edge computing has received much attention in recent years. The primary requirement of edge computing is real-time performance. However, the computing power required to handle rapidly increasing data is already beyond the growth rate of The associate editor coordinating the review of this manuscript and approving it for publication was Honghao Gao . Moore's Law, thus, computing framework consisting solely of CPUs can no longer meet the demand. As a result, heterogeneous platforms composed of CPU with GPUs, ASICs and FPGAs as its accelerators to jointly process massive data have received great attention. Figure 1 shows the characteristics of the three architectures with GPU, ASIC and FPGA as accelerator respectively. Compared to Architecture 1, the FPGA accelerator has lower latency and power consumption. It can also implement computational structures such as encryption and decryption VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ Z. Zhu et al.: Hardware and Software Task-Scheduling Framework Based on CPU+FPGA Heterogeneous Architecture that the GPU is not good at. Compared to the ASICs in Architecture 2, FPGAs are much cheaper. Most importantly, FPGAs can reconfigure dynamically. Moreover, FPGAs have good computing power in parallel processing, which can better meet the requirements of data processing speed and real-time requirements for edge computing. As a result, due to its low power consumption, low latency, reconfigurability and parallelism, CPU+FPGAs heterogeneous platform are widely used in many edge computing scenarios. For instance, FPGAs have been widely used in 5G wireless networks [1] . It can fit the high requirements of 5G wireless network on power consumption, flexibility and intelligence very well. FPGAs can also be applied to autonomous driving, which has high requirements for real-time performance, to accelerate the data computing [2] . The most important advantage of FPGA is dynamic reconfiguration. DPR-based FPGAs enable the user to reconfigure a portion of the FPGA dynamically at runtime, while leaving the remainder of the device continue running. This is especially valuable for embedded systems with strict resource and power constrains. It also opens a new scheduling dimension for the optimization of applications running on heterogeneous platforms with FPGAs as accelerators [3] . However, the DPR capabilities have its pros and cons. The reconfiguration time is about three orders magnitude higher than the context switch time in multitasking. To solve this problem, Liu et al. [4] and Duhem et al. [5] design a reconfigurable peripheral interface to achieve a throughput of 400MB/s and 800 MB/s, separately. Therefore, we believe that the reconfiguration time of DPR will continue to decline in the future, making DPR a research hotspot. However, the reconfiguration time is still unneglectable at present. The huge cost of the frequent reconfiguration in scheduling process must be effectively reduced or hidden using some methods, otherwise the performance and scheduling efficiency of the system will be seriously affected.
During execution time, if a certain computing unit is required but it is currently not available, the scheduler will reconfigure the FPGA to generate it. At this point, the scheduler has to handle both the tasks on the GPPs and the processes that dynamically reconfigure the FPGA. In this case, the system scheduling efficiency will be negatively affected if the interaction between these two groups of processes is not well handled. Therefore, there exists a demanding need to design an efficient scheduler for dynamic reconfigurable platform. In order to solve the above problems, many previous studies are proposed to abstract tasks in CPUs and FPGAs as software and hardware tasks, respectively. A novel scheduling model is also proposed accordingly. However, these works do not make full use of the characteristic of DPR and deliver undesirable performance. Andrews et al. [6] propose the concept of hardware thread in FPGA. It supports the unified scheduling of software and hardware tasks, and the communication between hardware and software threads through the synchronization and mutual exclusion primitives. Works in [7] - [10] further enrich and exploit the conception of hardware threads. For instance, ReconOS [8] extends the multithread programming model in software domain to the hybrid platform of CPU/FPGA, which enables users to write multithread programs mixing software and hardware threads together.
However, the researches mentioned above are based on static reconfiguration without exploring the DPR scenario. If the scheduling process does not fully consider the hardware task switching overhead in DPR, it may cause frequent system reconfiguration, thus increasing the scheduling overhead. What's more, the hardware tasks performed on FPGAs are usually cycling accurate. When they are issued, the hardware tasks should not be re-scheduled until the executions finish, that is, they cannot be preempted by other hardware tasks. This feature is a significant difference between hardware tasks and software tasks. However, the existing work has not made full use of this feature.
To solve the problems above, we propose a framework, which fully exploits the characteristic of hardware task switching and predictable execution time of hardware tasks in DPR, and reduces the task-switching frequency and the number of active tasks in the system, thus improving the scheduling efficiency. The proposed framework can also contribute to cycling accurate parts of other types of tasks. For example, the IO intensive tasks can be divided into cycling accurate ones and cycling inaccurate ones. Our proposed framework can optimize the task scheduling performance of cycling accurate tasks, thus contributing a lot to the overall performance. The contributions of this paper can be summarized as follows:
• At the operating system (OS) level, this paper proposes a novel framework for task scheduling on DPR to reduce the context switch times and overhead. It can resolve the defect that the task scheduling of the general OS repeatedly performs task switching in DPR without considering hardware tasks' characteristics. The framework has good scalability.
• This paper proposes a scheduling method which deeply explores the characteristics of hardware task switching overhead and predictable execution time of hardware tasks in DPR to improve scheduling efficiency of the whole system.
• The proposed scheduling method can fully utilize the characteristics of hardware task cycle precision and non-preemption to reduce the task context switch times and overhead, thus improving the scheduling efficiency. In our proposed framework, software threads are preemptive, while hardware threads are non-preemptive and the task scheduling process cannot affect the running task on FPGA.
• A wide range of experiments are conducted on a real platform Zynq to verify the effectiveness of the proposed framework. When the accelerator's execution time exceeded the reconfiguration cost by an order of magnitude, the efficiencies of all cases are more than 98%, and the efficiencies can reach to 90%-98% in the same order of magnitude. The rest of this paper is organized as follows. In Section 2 we present our framework and system model, and then introduce the scheduling method. Experimental settings and result analysis are detailed in Section 3. Section 4 summaries the related work, and finally section 5 concludes the paper.
II. BACKGROUND
Heterogeneous platforms in edge computing typically consist of two computing resources, GPP and reconfigurable hardware. In most cases, the GPP is CPU, and the reconfigurable hardware is FPGA.
A. ARCHITECTURE COUPLING GPP and reconfigurable hardware are coupled in the following ways [11] : As Architecture 1 shows in Figure 2 , FPGA is inside of the CPU. This allows for a traditional programming environment with the addition of custom instructions that may change over time. Here, the reconfigurable units execute as functional units on the main microprocessor data path, with registers used to hold the input and output operands. But the reconfigurable functional unit must communicate with the host processor every time a configurable ''instruction'' is used. 2) Reconfigurable units act as coprocessor or additional processor in the multi-core heterogeneous system. As Architecture 2 (a) shows in Figure 2 , the host processor's data cache is invisible to its attached reconfigurable processing unit, thus there is higher delay in communication between the host processor and the reconfigurable hardware. But this can guarantee high level of computational independence by transferring a large amount of computing tasks to reconfigurable hardware. The coprocessor is typically larger than the functional unit in Architecture 2 (b) and can perform calculations without constant supervision by the host processor. This coupling method allows reconfigurable logic to operate large number of cycles without intervention by the host processor, and typically allows the host processor and reconfigurable logic to execute simultaneously. 3) Reconfigurable units are used as individual external processing units. As Architecture 3 shows in Figure 2 , it allows for greater parallelism in program execution, but suffer from higher communications overhead. So this model is used when processing may occur for very long periods of time without a great deal of communication. The task scheduling framework proposed in this paper is based on the third coupling method, in which reconfigurable hardware is used as external independent processing unit. But it can also be extended to support all the ways in which GPPs and reconfigurable hardware are coupled together.
B. RECONFIGURE CLASSIFICATION
The reconfigurable technologies can be divided into different types according to different standards, mainly including the following two aspects: 1) According to the reconfiguration process, they can be classified as static and dynamic reconfiguration. The reconfigurable hardware can only be reconfigured during the initialization period of FPGAs in static reconfiguration, while in dynamic reconfiguration, their functions can be changed dynamically. 2) According to the reprogrammable area in the reconfigurable hardware, they can be classified as global and partial reconfiguration. In global reconfiguration, all the resources of reconfigurable hardware must be reprogrammed, while in partial reconfiguration only part of them are reprogrammed, leaving the remaining part unchanged. Partial reconfiguration can reduce the amount of reconfiguration units, thus shortening the time consumption. Figure 3 illustrates an example of static global reconfiguration. There are two AES Encoder accelerators in the FPGA during initialization period. The left one is required to be reconfigured to AES Decoder during runtime. At this point, the OS stops sending tasks to the entire FPGA, including the AES Encoder that is not required to be reconfigured, until the entire FPGA is idle. After the task execution on both AES Encoder accelerators have completed, the OS calls the driver to handle FPGA reconfiguration. The OS would continue sending tasks to the FPGA after the reconfiguration is complete. The design of OS in this approach is simple. It only needs to treat the FPGA as a whole. However, this can result in a waste of computing power of FPGAs. In contrast, DPR can take full advantage of the computing capacity of the FPGA, thus avoiding or reducing the waste of computing resources. Figure 4 shows an example of DPR. Each accelerator is abstracted as a black box in DPR, and the functionality of the black box can be changed at runtime. Initially, the FPGA is divided into two black boxes, both of which are AES Encoders accelerators. During operation, black box 0 is required to be reconfigured to AES Decoder. At this point, the OS stops sending tasks to black box 0, then waits until it is idle and calls the driver to reconfigure the FPGA. After the reconfiguration is completed, the OS continues to send tasks to black box 0. The black box 1 is unaffected and can keep running during the entire procedure. DPR can maximize the computing power of the FPGA, but the complexity of the OS design increases at the meantime, because the management granularity of the OS varies from the entire FPGA to each black box. In addition, the OS must take fully utilization of FPGA's computing capacity while minimizing the reconfiguration times and hiding the overhead, which is also a huge challenge. The question mentioned above is the key issue to be solved in this paper.
III. FRAMEWORK AND SYSTEM MODLE
In order to use DPR more conveniently and fully utilize the computing capacity of the DPR platform, a task scheduling framework, as shown in Figure 5 , is proposed in this paper. The framework includes a hardware architecture and a software stack. 
A. HARDWARE ARCHITECTURE
The hardware architecture consists of the following components:
CPU Core. The CPUs are used as GPPs. Tasks that are not suitable for FPGA acceleration will run on the CPU core. Take JPEG encoding as an example, JPEG can be divided into four stages: Color space Convert (CC), two dimensional Discrete Cosine Transform (DCT), Quantization (Quant) and Huffman Entropy coding (Huffman). The first three stages have fixed amount of input and output parameters, therefore they are appropriate to run on FPGA, while the Huffman is more suitable to run on the CPU. FPGA Resources. FPGA resources that support DPR are used as accelerators and are organized into black boxes. The position and size of the black boxes cannot be changed at runtime but their functions can be reprogrammed at any time. The intellectual property (IP) core Library stores the bitstream of accelerators, which will be downloaded to the FPGA. Usually, the IP core Library is generated by offline application profiling.
On-chip Interconnects. The on-chip interconnections connect different types of computing resources, such as GPPs and FPGAs.
Reconfiguration Controller. The reconfiguration controller is in charge of the reconfiguration of FPGAs [12] . When a reconfiguration command is obtained from the scheduler of the OS, the reconfiguration controller reprograms a particular part of the FPGA circuit by downloading the bitstream in the IP core library.
Peripheral Bus. The framework in Figure 5 is loose coupling, that is, the FPGA acts as an externally independent computing unit. In this mode, GPPs and FPGAs use the peripheral bus for data transmission. It is worth to note that peripheral bus is not always necessary. For example, if the FPGA is a part of GPPs as Architecture 1 shows in Figure 2 , data can be transferred via the CPU internal bus.
B. SOFTWARE STACK
The software stack is composed as follows:
Resource management. In addition to the task scheduler and driver of the general OS, the framework also includes a resource management module that sends reconfiguration commands to the reconfiguration controller at runtime. The resource management module manages the FPGA in the form of black box. The black box is represented as set B, and |B| represents the total number of black boxes. Each black box is represented as b i , i = 0, ..., |B|. Each black box is a carrier of multiple functions and can be dynamically reconfigured to support different functions. The functions on each black box are represented as a set F i , i = 0, ..., |B|. The union of all F i , denoted as F, is the function set that the FPGA can provide. At each moment during system operation, a black box can be represented by (Position, Function). Position can be represented by black_box_number, each corresponding to its physical address. The OS uses the black_box_number to perform task scheduling, while using physical address to manage the reconfiguration process and the design of the driver. Function indicates the function of the black box at the moment. It can only be one element of the black box's function set. Figure 6 is an example of resource management. The resources on the FPGA are divided into 5 black boxes, namely |B| = 5. The resource management module of the OS first labels the Position (0, ..., |B|) of all black boxes and saves the set of functions that each black box can perform. It should be emphasized that the functions that each black box can perform are usually different. In the example above, the function set that black box 0 can complete is {AES Enc, AES Dec}, while the function that black box 1 can complete is {AES Enc, AES Dec, IDCT }. In general, the function set is completed offline, and the partition of black boxes on the FPGA can be achieved by design space exploration. It is beyond the discussion of this paper. The task scheduler obtains the physical view of the FPGA from the resource management module, and schedules task according to their execution status in the user program. The Position and the function set F are used in the scheduling process. When a process in the OS needs to communicate with a black box or needs to perform reconfiguration, it should write the corresponding bitstream into the physical address.
Driver modules. Based on general OS, the driver of the framework proposed in this paper also contains driver for operating the black box. The driver consists of two parts, and one of them will reconfigure the black box. This is usually provided by EDA vendors. The other one that OS threads communicate with the black box. It utilizes the design method of user space I/O framework. When the software task in the OS communicates with the hardware task running on the black box, their responses have to be handled in the kernel mode, in contrast, the data transmission process and subsequent operations can be completed in the user mode. The benefits of this mechanism are:
• The number and size of black boxes laid out on the FPGA can be adjusted offline.
• User mode driver can maintain the driver program more conveniently. In the meantime, user mode driver can avoid OS hang and panic as much as possible caused by the errors during the operation of the FPGA. Task scheduling. The OS is the most important part of the proposed framework. Tasks can be organized as threads, which is the scheduling unit of the OS. In order to support the DPR platform and make full use of the FPGA, the following issues should be considered in the OS:
• The OS is mainly responsible for the resource management of the FPGA, which is coupled with task scheduling.
• As in Hthread [6] , threads running on FPGA are hardware threads, while the others are software threads. Typically, software threads are preemptive and are scheduled according to time slice. However, saving the context of hardware threads is time consuming and requires extra memory space [13] . Therefore, in our proposed framework, software threads are preemptive, while hardware threads are non-preemptive and the task scheduling process cannot affect the running task on FPGA.
• The structure of the hardware thread's context is different from the software thread. Typically, the context of the software thread contains primary register values. However, the context of the hardware thread may contain more information, such as circuit states. To create a unified scheduling framework, each hardware thread is associated with a software thread, called delegate thread [14] . All communications related to hardware threads are handled by their delegate thread.
Algorithm 1 Scheduling Algorithm
Input: Tasks to be scheduled. Output: Tasks execution actions and reconfiguration actions. 1: if the current task T is a software task then 2: generate a software thread and process the next task; 3: end if 4: if the current task T is a hardware task then 5: generate a delegate thread; 6: end if 7: if all black boxes b i ∈ B are busy then 8: min_time = min predict_time of all black boxes b i ∈ B; 9: set delegate thread state long_sleep; 10: sleep_time = predict_time − current_time; 11: else 12: select free black box b i ∈ B with smallest last_intr_time; 13: reconfigure b i to the function of current task; 14: update F i of b i ; 15: generate a hardware thread on b i ; 16: 
end if
The variable used in Algorithm 1 is shown in Table 1 . The time complexity of the scheduling algorithm is O(n), where n is the number of black boxes. As shown in the Figure 7 , assume that there are two black boxes, marked as b 1 and b 2 , as well as three tasks, marked as T 1 , T 2 and T 3 , to be executed in the system. The corresponding functions of these tasks are F 1 , F 2 and F 3 , respectively. Assume that T 1 and T 2 are generated at time t 1 , and T 3 is generated at time t 6 . The function set of b 1 is F 1 , F 2 , which is initially empty. And the function set of b 2 is F 3 , and its initial function is F 3 . Figure 7 (a) is an example of task execution in DPR without considering hardware tasks' characteristics. At time t 1 , the OS performs a context switch scheduling task T 1 , and performs reconfiguration at time t 2 to reconfigure the function of b 1 into F 1 . The task execution begins at t 3 and ends at t 4 , returning the data after a context switch. During the execution of T 1 , no black box can execute T 2 , so T 2 will be scheduled by the OS all the time to perform context switches. At time t 6 , task T 3 is generated, but since the OS is scheduling T 2 at this time, it is necessary to wait until time t 7 , and then T 3 can be scheduled ultimately after a context switch. Then the execution on black box b 2 comes to time t 9 , and after a context switch, the data is returned at t 10 . Figure 7 (b) is the execution process of the framework proposed in this paper. The execution of T 1 is the same as in Figure 7 (a). T 2 will remain in the long_sleep state after a context switch, and will not be rescheduled by the OS until t 5 . When T 3 is generated at t 6 , it can be scheduled by the OS, so that execution can be completed at t 9 . After comparison, we can find that using the framework of this paper can reduce the number of context switch. In the example above, the number of context switch in (b) is three times less than in (a). It can also result in more efficient use of FPGA computing power. In the example above, T 3 in (b) can be executed earlier than in (a).
Algorithm 1 can be integrated into a general OS. Software threads can use the original scheduling algorithm, while the OS requires an additional thread state, long_sleep, when processing the delegate thread. Due to the existence of the delegate thread, hardware threads do not need to be scheduled in most cases. When hardware threads are generated, the OS doesn't know their existence until the hardware threads complete execution through interrupts or other trigger-based communications, thereby reducing task context switch times and overhead.
The proposed scheduling method can, on the one hand, fully utilize the characteristics of hardware task cycle precision and non-preemption. The hardware task will be in the long_sleep state for a long time, and the delegate thread corresponding to it will not perform context switch, nor occupy processor resources. Therefore, the task context switch times and overhead are reduced, thus improving the scheduling efficiency. The problem that the DPR platform has huge performance loss during scheduling is also solved. On the other hand, the scheduling method has good scalability. The example given in the Algorithm 1 is just a naive scheduling strategy. OS designers can design more efficient and optimized scheduling algorithms according to their needs without changing the design framework of the whole system. Therefore, the framework proposed in this paper overcomes the problem that previous methods can only be targeted to specific platforms and OS.
The service-oriented APIs. Programming model provides service-oriented APIs for applications. During the programming process, users do not need to know the state of the FPGA, that is, they do not need to know the functions that the FPGA can perform at a certain time when the system is running. It is only necessary to know the functions that the system can complete according to the provided C library.
C library. The C library of programming model contains the implementation of the API. There are two forms of implementation for any function that will run on the FPGA, one is execution on the FPGA using hardware libraries and the other is execution on GPPs using custom software libraries. Therefore, the programmers can decide by themselves whether to run the function on the FPGA. create Adder; 3: end for 4: for i = 5; i < 10; i + + do 5: create Minus; 6 : end for 7: for i = 0; i < 10; i + + do 8: wait for the threads to execute; 9: end for on the FPGA. If there is no Adder or Minus accelerator on the FPGA when thread creation function is called, the OS is responsible for deciding when and how to reconfigure the FPGA. According to the framework we propose and the following code, the execution flow of the whole system is as follows:
Step 1: First, the system engineer needs to select a benchmark to analyze the applications in a certain field according to their characteristics. Then, the hot tasks are obtained, and suitable tasks for the FPGA acceleration are selected, that is, the function set F that the FPGA can complete is determined.
Step 2: The system engineer performs a design space exploration based on the total resources of the FPGA and F to determine the black box set B that the FPGA needs to perform, including the position and size of each black box, and the function set F i that each black box can complete. Then IP core Library is generated to store the bitstream corresponding to each function of each black box.
Step 3: According to the result of Step 2, the system engineer writes the C library ''ipcore.h'' using the platform, and the driver for reconfiguring and using the black box function in the OS.
Step 4: At this point, the user can write programs on the platform. Users need to obtain the list of acceleration functions and related interfaces that the system provides through ''ipcore.h''. For example, the pthread function calls the functions provided by the system, such as, Adder and Minus, which can be accelerated in the FPGA, and waits for the execution of the task to end. At this point, the user's work is over.
Step 5: After receiving the Adder and Minus invoked by the user, the OS performs task scheduling to determine the dependency between the tasks and the execution timing of the tasks [15] , [16] , and checks whether a suitable black box can be used as an execution component. If yes, go to Step 7, otherwise go to Step 6.
Step 6: When the scheduler of the OS finds that there is no accelerator required to complete the function in the system, or the accelerator on the current FPGA is insufficient to exert the computing performance of the FPGA, the resource management module is called to reconfigure some black boxes. After the reconfiguration is completed, return to Step 5.
Step 7: The hardware task is sent to the FPGA through the driver module. After the execution is completed, the result is returned by the driver. At this point, the execution of the entire program ends.
It should be noted that Step 1 to Step 3 do not need to be repeated frequently, and it is usually designed once for a specific field. This is also the advantage of the DPR technology used in this system compared with static reconfiguration. In general, the framework proposed in this paper combines the system hardware design, OS design and programming model design together. The programmer can use the C library provided by the framework to determine the set of functions that the FPGA can perform. But the design of the entire hardware system, task scheduling and FPGA runtime reconfiguration are all transparent to the programmer, thus greatly reducing the programming difficulty. By optimizing the task scheduling in the OS and making fully use of the cycle precision and non-preemption characteristics of task execution on the FPGA, the number of reconfigurations and the performance loss can be further reduced, thereby ensuring the system to be easy to use, efficient and scalable.
IV. EXPERIMENTAL RESULTS
In order to evaluate the efficiency of the proposed framework, we build a DPR platform on Zedboard with Xilinx Zynq-7000. Zedboard is composed of Processing System (PS) and Programmable Logic (PL). Among them, PS with dual-core ARM Cortex-A9 is used as GPPs, and PL with Xilinx 7 series programmable logic is used as FPGA resources. In the experimental platform, the FPGA is used as an external stand-alone processing unit (Architecture 3 in Figure 2 ). GPPs and FPGA are interconnected through Advanced Microcontroller Bus Architecture (AMBA) bus. The programming model, OS scheduling method, driver, etc. proposed in this paper are run on GPPs. We integrate the proposed scheduling method into Linaro OS, which is based on the Linux kernel and supports DPR.
A. HARDWARE ARCHITECTURE
The Hardware layout is shown in Figure 8 . Area A and B contains GPPs and FPGA, respectively. GPPs communicates with FPGA through AMBA bus under AXI4-Lite protocol. In our experiments, we set part of the FPGA as two black boxes, which can be reconfigured separately at runtime, as is shown in area C of Figure 8 . In theory, the maximum number of black boxes is limited only by the resources of the FPGA, and it is not affected by the framework proposed in this paper.
The two black boxes can be reconfigured to IP cores Adder and Minus in our experimental settings. The system engineer generates an IP core Library based on the number of black boxes and the set of functions that can be completed. The IP core Library contains the IP core bitstreams required to reconfigure the black box, which contains the programming information for FPGAs. Each IP core is stored in the format of {Function}_{Position}.bin. According to the settings of the experimental platform, there are four IP core bitstream files in the IP core Library: Adder_0.bin, Adder_1.bin, Minus_0.bin and Minus_1.bin. When the OS is running, the bitstream files in the IP core Library can be programmed into the FPGA through commands to complete partial reconfiguration, and the function of a certain black box can be changed without affecting the normal execution of other black boxes. In order to shield the end user from the hardware information of the system (such as the current function on each black box, the position of the black box, etc.), the system engineer first calculates the union of all the functions that the black boxes can complete, and then implements the corresponding software and hardware interfaces which are packaged in the Custom Software Library and Hardware Library in the framework programming model's C Library. In this experiment, SoftAdder( ) and SoftMinus( ) APIs are added to the Custom Software Library, and HardAdder( ) and HardMinus( ) APIs are added to the Hardware Library. The end user only needs to provide input and output parameters when calling the APIs in the C Library without providing any hardware information, such as the number or position of the black box.
A typical IP core framework is shown in Figure 9 . The GPPs will send an interrupt request when the black box execution ends. The overall framework of the IP core should meet the AXI protocol. Therefore, an interrupt pin is in demand, and it is connected to the interrupt controller of the GPPs. The IP core comprises four main parts: User_logic, Finite State Machine (FSM), Input FIFO and Output FIFO. User_logic is an unrelated part of AXI protocol, which completes user defined computation. The FSM, Input FIFO, and Output FIFO provide an interface that satisfies the AXI protocol. The FSM controls the data transfer between User_logic and the FIFO, and issues an interrupt based on the state of User_logic. In design process, usually only User_logic changes, while other parts (also called shells) remain unchanged.
A typical IP core shell has 6 registers. Reg 0 is the interruption enable signal, which is always set to 1 in our experiments. Reg 1 and Reg 2 are used for the input data buffer and Reg 5 is used for the output data buffer. Reg 3 means the delay of the IP core. The efficiency of the proposed framework is related to the execution time of the IP core in the black box. In order to test this relationship, the IP core execution time is precisely adjusted through Reg 3. Reg 4 stores delegate thread PID. When IP cores finish execution, the interrupt handlers wake up their associated delegate threads to receive data. The execution flow of the driver is shown in Figure 10 . The driver proposed in this paper adopts user space I/O driver architecture. To support good portability, data preparation and processing are completed in user mode, and interrupt response is mainly completed in kernel mode. The data copy between the user mode and kernel mode can be solved with zero-copy technique, which can avoid unnecessary CPU data copying process. Therefore, the OS hang and panic caused by errors in the operation of the FPGA are avoided to the utmost extent, and the robustness of the system is improved. The end user can create a hardware task. After being scheduled by the OS, the delegate thread of the hardware task enters the driver. Then, after the hardware task is executed, the delegate thread receives the data and returns to the user program. The main actions performed by the driver are shown by the dashed box in Figure 10 , including:
• Preparing data in user mode. • Switching to the kernel mode and writing 1 to Reg 0 to allow interruption. Writing the PID of delegate thread to Reg 4. Writing the required number of delay cycles to Reg 3 to complete the preparation work before data transmission. Writing the input data to Reg 1 and Reg 2 to complete data transmission.
• IP core generates interrupts after execution, and Driver responds to interrupts through Interrupt handler.
• Interrupt handler reads output data from Reg 5.
• Switching to user mode and complete data processing.
B. CASE SETTING
In our experiment, the two black boxes are initially programmed as IP cores Adder and Minus. It should be emphasized that we do not care about the actual functions of IP cores. The execution cycle of the IP core will affect the efficiency of the proposed framework, and it is difficult to test all scenarios on a physical DPR platform. Therefore, we use representative case studies as experimental applications to verify the effectiveness of the proposed framework. The two black boxes are numbered as 0 and 1, respectively. The IP core type of the black box 0 is Adder, and the IP core type of the black box 1 is Minus. The case studies in this paper is shown in Table 2 . The execution cycle of Adder is always set twice as Minus. In this way, it effectively reflects the task scheduling and reconfiguration process, and also conveniently calculates the theoretical execution time of the task sequences. The execution process is illustrated in Figure 11 . In Figure 11 (a) , there are 10 hardware adder tasks without data dependency. This means all the 10 tasks can run in parallel if there are enough Adder IP cores. However, only two black boxes are set up on the experimental platform and one is programmed as an Adder. In order to make full use of FPGA Z. Zhu et al.: Hardware and Software Task-Scheduling Framework Based on CPU+FPGA Heterogeneous Architecture resources, black box 1 is reconfigured to Adder. This is automatic by the proposed framework. The programmers have no sense of the reconfiguration procedure. In Figure 11 (b) , there are 5 hardware adder tasks and 5 hardware minus tasks without data dependency. The execution time of Adder is twice as Minus, so the two black boxes run parallel at the beginning, then black box 1 is free and reconfigured to Adder. In Figure 11 (c) , there are 10 hardware adder tasks with data dependency. Therefore, the 10 tasks have to run one after another, and only black box 0 is used.
C. RESULTS ANALYSIS
The symbols used in results analysis are summarized in Table 3 . The serial execution time is defined as the time required to execute a task sequence in serial. The Parallel execution time is defined as the minimum amount of time required to execute a task sequence in parallel while allowing reconfiguration. Theoretical time refers to the time required to execute a task sequence without considering the scheduling overhead. Practical time refers to the time required to execute a task sequence when considering the scheduling overhead, that is, the actual measurement time. We use Eff i to illustrate the efficiency of the framework presented in this article. According to Figure 11 and Table 3 , the following equations can be obtained.
where For all three cases in the experiment, the serial theory execution time of Case 1 is 10 consecutive adder tasks on black box 0, and no reconfiguration process takes place.
The parallel theory execution time of Case 1 is 5 consecutive adder tasks on black box 0,1, and a reconfiguration process on black box 1.
Serial theory execution time of Case 2 is 5 consecutive adder tasks on black box 0, and then 5 minus tasks are executed continuously on black box 1. No reconfiguration process takes place. TST 2 = t adder * 5 + t minus * 5
The parallel theory execution time of Case 2 refers to Figure 11 (b). Since t reconfig is unknown, the final program execution time depends on which black box ends first.
TPT 2 = max t adder * 4, t minus * 5 + t adder + t reconfig Since Case 3 does not have a reconfiguration process, its serial theoretical execution time is the same as the parallel theoretical execution time. It is executed 10 adder tasks on the black box 0 : TST 3 = TPT 3 = t adder * 10 Figure 12 shows the speedup of the case studies. The horizontal axis represents the execution cycles of Adder, while the vertical axis represents speedup. 1) The theoretical and practical speedups of Case 1 and Case 2 are both greater than 1, while Case 3's theoretical and practical speedups are both less than 1. Thus, we can get the conclusion: the higher the data dependency there is between tasks, the lower the speedups will be. In other words, if the potential parallelism between tasks is higher, using the framework proposed in this paper can achieve a higher speedup. 2) As the t adder decreases, the speedups decrease. Therefore, we can get the conclusion: for the DPR platform, due to the reconfiguration overhead, the longer the IP core execution time is, the better the platform advantages. 3) As the t adder decreases, the speedup of Case 1 drops the fastest. This is because the reconfiguration operation occurs on the critical path of task execution (that is, the theoretical parallel time includes the t reconfig item), and in Case 2 reconfiguration operation does not necessarily occur in the critical path, the conclusion is drawn: through the scheduling algorithm, the reconfiguration timing can be reasonably selected, and the purpose of hiding the reconfiguration overhead can be achieved. Figure 13 illustrates the efficiency of this experiment. The horizontal axis represents the execution cycles of Adder, while the vertical axis represents efficiency. In fact, we are more concerned about the efficiency of the proposed framework when t reconfig is reduced with the development of reconfigurable technology. The main factors affecting efficiency include the overhead of reconfiguration overhead, scheduling overhead, data transmission time, etc. The smaller the proportion p of the overhead in the task execution time, the higher the efficiency of the framework. It can be approximated by t reconfig /t adder to measure p. In our experiment, t reconfig on the physical platform is basically fixed. Therefore, this paper changes tadder to achieve the purpose of changing p. In our experiments, when t adder is 2000 cycles, which is an order of magnitude higher than t reconfig , the efficiencies of all cases are over 98%. This shows that the framework proposed in this paper is effective.
From the perspective of experimental results, when the reconfiguration overhead and task execution time are in the same order of magnitude, the scheduling efficiency can reach more than 90%, especially when the reconfiguration overhead is one order of magnitude lower than the task execution time, the scheduling efficiency can reach more than 98%. Therefore, the framework is flexible and efficient.
It is plausible to expect that with the development of FPGA reconfigurable technology, the reconfigurable bandwidth will become larger and larger, and the reconfigurable overhead will become smaller and smaller, thus the scheduling efficiency will be further improved.
V. RELATED WORK
The rapid development of edge computing is raising higher and higher requirements to data storage and computing capacity of edge computing system. Plenty of researches have been done focusing on data storage [17] - [20] . At the meantime, the computing power is also a bottle neck of system performance. Heterogeneous computing systems composed of CPU and FPGAs have been widely used in edge computing scenarios. Yan et al. [1] propose a 5G satellite edge computing framework. In order to increase the flexibility of the framework in complex scenarios, they unify the resource management of CPU, GPU and FPGA. The experiments verify that the proposed framework has lower delay and flexibility than 5G network. Du et al. [2] present a FPGA-based acceleration of game theory algorithm in edge computing for autonomous driving. The paper validates that the speedup of the accelerator is 2.4 times compared with performance on a CPU. For more application scenarios, edge computing not only requires higher quality of service support [21] - [25] , but also has higher requirements on the computing power of the processor. Therefore, the CPU+FPGA Heterogeneous Architecture is especially important for edge computing.
The most important advantage of FPGAs is dynamic reconfiguration. With its low power consumption, low latency, and reconfigurability, a large number of solutions have been proposed to use FPGAs as accelerators in multi-core heterogeneous platforms. These solutions can be classified as static and dynamic reconfiguration according to the reconfiguration methods. The static approach can only execute reconfiguration action at the initialization phase, while the dynamic one can achieve it at runtime. In the static approach, the maximum number of tasks mapped to FPGAs is limited by the physical size of FPGAs, thus bringing restrictions to the performance and application scenarios. DPR as a dynamic approach can provide flexible configuration of functional units. This can leverage the computing ability of FPGAs. However, DPR introduces runtime reconfiguration overhead, and it is about three orders of magnitude higher than the multi-task context switch time,which is still not negligible at present and can seriously affect the performance of the system.
To address the problem mentioned above, various researches are implemented based on DPR platform. A fast reconfiguration interface is proposed in [4] , [5] . It can reach a throughput of 400 MB/s and 800 MB/s, respectively, which can reduce reconfigurable time significantly. Iturbe et al. [26] presented the R3TOS. They used reconfigurable interfaces to support a better dynamic task allocation. Beckhoff et al. [27] proposed a new partial reconfiguration framework. The flexible construction can integrate reconfigurable modules into the system efficiently. Pellizzoni and Caccamo [28] proposed an optimization method to divide the reconfigurable area of the FPGA platform into slots. Both the size and the location of slots can not change during execution time. The hardware tasks take up one or more slots when executed, which solves the problem that static task assignments can only obtained offline. The slotless method introduces complex reconfiguration algorithms in the physical platform. Whereas, this paper only focuses on task scheduling on the DPR platform, so the slot method is used. Saha et al. [29] presented a new scheduling algorithm to preemptible hardware tasks taking the advantage of the higher speed of modern refactoring interfaces. It dynamically reallocates the tasks each time when they are terminated. However, the algorithm assumes uniform partition strategy and fixed reconfiguration time, which result in huge wastage of FPGA hardware area. Dittmann and Frank [30] resolved the problem of reconfiguration requests scheduled as a single core.
However, the work listed above mainly improve the performance of DPR from the perspective of physical design. In the scenario of FPGAs supporting DPR and GPPs, the OS is the manager and dispatcher of the system resource. Effectively utilize the feature of DPR and uniformly schedule the system hardware and software tasks have become the key to reduce reconfiguration overhead and improve system performance. As a result, we explored the potential performance advantages of DPR from the perspective of OS scheduling in this paper.
Some previous studies have proposed task abstraction and scheduling model for FPGA-based heterogeneous platforms on OS level. OS4RS [31] proposed an OS framework for embedded reconfigurable platforms with fixed tasks. Task graphs were used to manage software and hardware tasks running in the system. The communication between hardware tasks is determined by the edges of the task graphs. However, this method only suits for systems with fixed tasks and is not applicable for general systems. Andrews et al. [6] proposed Hthread, a unified multithreading model for reconfigurable systems, which defines threads running on GPPs as software threads and threads running on reconfigurable resources as hardware threads. Data synchronization and communication between software threads and hardware threads can be achieved by shared memory and other methods. The concept of hardware threads is a great idea and is widely used in the last decades. In BORPH [10] , the OS kernel is divided into two parts, MK and UK. MK provides the mainstream OS functions, and UK is responsible for reconfigurable logic management. This separation gives convenience for design, and makes it easy to manage the reconfiguration and scheduling process separately. However, How to fully utilize the performance of DPR system is not detailed. ReconOS [8] extended the multi-threaded programming model of the software domain to the hybrid platform of CPU/FPGA. This model enables users to write multi-threaded programs that mix software and hardware threads together using the same method with writing software multi-threaded programs. The threads are transparent to each other, that's to say, they do not care whether the thread interacting with it is a hardware or a software thread. However, the characteristics of the hardware task cycle accuracy and task preemption overhead are not fully considered, which will result in performance loss in DPR system. FOSFOR [9] introduces the design of distributed OS into the reconfigurable system, and abstracts the management of reconfigurable resources into a series of OS kernels. However, this distributed model has both pros and cons. It can increase system portability, whereas communication efficiency will become a bottleneck of system performance.
However, most of the researches mentioned above at the OS level are based on the assumption of static reconfiguration strategy. They do not take the characteristics of hardware task cycle accuracy and preemption overhead into full consideration, thus offering insufficient support for DPR scenarios. In this paper, we combine the existing research ideas and the characteristics of hardware tasks together, and implement a scheduling framework of DPR system based on the unified software and hardware thread model. It helps to fully exert the performance of DPR systems and improve scheduling efficiency. The proposed method is widely applicable and scalable, and has been verified on existing platforms.
VI. CONCLUSION
Currently, general purpose processor can no longer meet the calculation of massive data, so heterogeneous computing platforms have been proposed and widely used in edge computing. The CPU-FPGA heterogeneous architecture achieves both high performance of hardware acceleration and flexibility of software. However, the existing work does not fully consider the differences between hardware tasks and software tasks, resulting in low scheduling efficiency. To address this problem, this paper proposes a task scheduling framework on DPR platforms.
We introduce the hardware architecture and software stack at first. The hardware architecture packages FPGA resources as black boxes, which can be reconfigured at runtime, and the software stack consists of a unified software and hardware threading model, a task scheduling method in the OS, and a service-oriented programming model. Then, in order to fully exploit the potential computing power of DPR system and improve the scheduling efficiency, we propose a task scheduling framework. The framework takes the full consideration of the preemption overhead, the reconfiguration time and hardware tasks' cycle accuracy, and thus it is highly applicable and scalable. The hardware thread participates in the scheduling of the OS through the associated delegation thread, and optimizes the task scheduling model, thereby reducing both the number and the overhead of task context switch. In addition, experimental results on the Zedboard platform have shown that the proposed framework works well. When the execution time of the accelerator exceeds the reconfiguration cost by an order of magnitude, the efficiency in all cases is greater than 98%, and the efficiency can reach to 90%-98% when the order of magnitude is the same.
In the future, we plan to extend our framework to more scenarios where FPGAs can be used as part of GPPs or a coprocessor, whereas FPGAs are now individually used as external processing units. And we plan to do further research on how the data path of GPPs and the cache will affect our proposed framework. 
