Despite all the available commercial and opensource frameworks to ease deploying FPGAs in accelerating applications, the current schemes fail to support sharing multiple accelerators among various applications. There are three main features that an accelerator sharing scheme requires to support: exploiting dynamic parallelism of multiple accelerators, sharing accelerators among multiple applications, and providing a nonblocking congestion-free environment for multiple applications to call multiple accelerators. In this paper, we developed a scalable fully functional hardware controller, called UltraShare, with a supporting software stack that provides a dynamic accelerator sharing scheme through an accelerators grouping mechanism. UltraShare allows software applications to fully utilize FPGA accelerators in a non-blocking congestion-free environment. Our experimental results for a simple scenario of a combination of three streaming accelerators invocation show an improvement of up to 8x in throughput of the accelerators by removing accelerators idle times.
I. INTRODUCTION
In the era of big data and computational/data-intensive applications, researchers have proposed different solutions to address the demanding performance of these applications [1] . Among all the proposed techniques, deploying heterogeneous architectures is a very promising solution that can be integrated to other solutions to tackle high computational loads. FPGAs have been assisting CPUs as custom hardware accelerators with their fine-grain programmable hardware resources. FP-GAs are considerably low-power and researchers have shown an order of magnitude acceleration for various applications [2] - [5] .
Streaming applications, such as image/video processing applications, real-time vision algorithms, and network packets encryption algorithms, are in the category of FPGAfriendly data-intensive applications [6] . However, the multilevel memory hierarchy model that is used in OpenCL-based platforms fails to meet the high-demanding data throughput of streaming applications. Ruan et al. [7] have shown that a point to point data transfer from main memory to FPGA BRAMs can significantly improve the performance of the streaming accelerators comparing to an OpenCL hierarchy memory model. When multiple accelerators on an FPGA are deployed to accelerate various streaming applications, shared hardware resources incur more stringent constraints on high-throughput data movement between the FPGA and the main memory. While a point-to-point data transfer between the FPGA local memory and the main memory is necessary, a scalable and efficient high-throughput data movement infrastructure between the host and the FPGA accelerators is required. Accelerator allocation is a crucial task when multiple accelerators on an FPGA device are shared among various applications. To the best of our knowledge, all the current FPGA accelerator frameworks [7] - [12] follow a static accelerator allocation scheme, i.e. software developers have to exactly specify the target accelerators for any access request in the software code. This can lead to a poor utilization of accelerators when being shared among various applications. In this paper, we propose UltraShare, an open-source RTL level framework, which enables a scalable and efficient FPGA-based accelerator sharing. Unlike the currently available frameworks, UltraShare invokes a dynamic accelerator allocation to requests. Ultra-Share also reduces the idle time of the accelerators through deploying an accelerator grouping mechanism which results in a considerable improvement in the performance of FPGA accelerators. In addition, the data size and throughput can vary from one group of accelerators to another. UltraShare provides a fair or priority-based data transfer to/from accelerators. We briefly summarize the contributions of this paper as follow: 1) For the first time, we introduce a non-blocking FPGAbased accelerator sharing framework, called UltraShare, by proposing a hardware controller to enable dynamic sharing, 2) We propose an accelerator grouping architecture to enable efficient access to accelerators shared among multiple streaming applications, 3) We developed UltraShare in Verilog hardware programming language which makes it compatible with all the FPGA vendors and RTL synthesis tools. UltraShare is an open-source framework and can be used and contributed by other research groups, 4) We evaluated UltraShare with the standard IP-cores interfacing standard AXI-Stream protocol.
II. BACKGROUND AND RELATED WORK
In data centers and edge computing platforms, different applications use available accelerators on FPGAs to accel-erate their computational intensive kernels. The accelerators must be accessed through an underlying infrastructure that interfacing host and FPGA accelerators. There are currently a few numbers of academic [7] , [10] - [13] and industrial [8] , [9] frameworks that are designed to connect a host to the FPGA accelerators. However, they either do not support or fail to provide a seamless interface to multiple accelerators accessed simultaneously by various applications.
An efficient multi-accelerator management is necessitated to first provide the possibility of accessing accelerators by the host applications and then minimizing the access overheads and accelerator idle times. Through a static accelerator allocation [7] - [12] exploiting concurrent access to accelerators to their full potential is not possible, since different applications are not aware of the status of all the accelerators on FPGAs.
While using software level blocking mechanisms, like semaphore, allows some of the currently available frameworks support accessing accelerators among multiple threads in one application [7] , [10] - [12] , they fail to support accelerator sharing among multiple applications. OpenCL based frameworks like Xilinx ® SDAccel [8] and Intel SDK [9] , support sharing accelerators among multiple applications; however, these platforms do not support a non-blocking congestion-free accelerator invocation and do not allow concurrent requests to be issued to the same accelerators.
In most of the available frameworks, a finite-state machine in the software stack keeps the host interacting with accelerators and does not allow a new request to be issued for the same accelerator. To address a congestion-free non-blocking accelerator sharing, a single command-based mechanism is proposed in MQMAI [13] for FPGA-based accelerators. This mechanism eliminates all the interactions between a host and accelerators after the request is issued. However, MQMAI is not considering a hardware controller to efficiently manage accelerator sharing.
III. ULTRASHARE: MULTI-ACCELERATOR FRAMEWORK
One of the most important requirements of a multiaccelerator system is the capability of sharing accelerators among different host applications. An efficient accelerator sharing mechanism addresses the following features: 1) Exploits dynamic parallelism: All the requests from one application are distributed among the available accelerators. 2) Shares accelerators among multiple applications: Multiple application can share a single accelerator. 3) provides Nonblocking congestion-free access to accelerators.
In this paper, we propose an adaptable scalable hardware controller, called UltraShare, to address the three mentioned features of accelerator sharing along with an efficient data movement between a host and FPGAs. UltraShare is composed of five main parts ( Figure 1 ): 1) multi-queue accelerator request, 2) dynamic accelerator allocation, 3) scatter-gather, 4) accelerators controller, and 5) data transfer. The inputs to the hardware controller are streams of commands from the host main memory. In the following we explain different components of each part in detail.
A. Multiple Command Queues
Accelerator allocation is responsible for assigning commands to accelerators dynamically. In a single-queue nongrouping mechanism, always the command at the head will be processed. If there is no accelerator available for this command, it would block the rest of commands to be processed. Thus, the single-queue mechanism may result in a severe accelerator underutilization due to the blocking requests among multiple applications requesting accelerators of different types. In order to tackle this, UltraShare proposes an accelerator allocation based on an accelerator grouping mechanism. The accelerator allocation includes the following components.
Command Detector: A command includes all the information required to process the associated request with no interaction with the host. When a command arrives, the command detector pushes the command into one of the command queues based on the requested accelerator type. In this paper, we only consider a single-level accelerator grouping which is based on the accelerators types, however, UltraShare framework allows more sophisticated strategies, e.g. a two-level priority-based grouping the first level of which is based on the priority of the accelerators and the second level is based on the accelerators types. In this regard, some of the accelerators can be reserved for high-priority requests.
Command Queues: a command queues is a simple FIFO implemented with BRAMs. For each group of accelerators, there is one dedicated command queue.
B. Dynamic Accelerator Allocation
Accelerator Allocation Unit (AAU): The main unit of the dynamic accelerator allocation part is AAU. This unit assigns an accelerator to the command which is on the head of a command queue. AAU travels between the queues in a round-robin scheduling mechanism. If there is no accelerator available for a selected command, the next command queue will be selected. The inputs to AAU are: 1) the status of all accelerators, and 2) the output of accelerator group table that represents the mapping of accelerator numbers to the accelerator groups.
Accelerator Group Table: Accelerator group table is a lookup table that provides the information of matching accelerators to the accelerator groups. This lookup table is reconfigurable through software APIs, and a user can regroup accelerators, remove an accelerator from a group, or add an accelerator to a group.
Command Requester: After allocating an accelerator to a command, command requester submits a request to the DMA to fetch input data (RX) and output data (TX) scatter-gather lists.
Request Information Queue: The information related to each processed command will be stored in a queue to be used when the associated scatter-gather lists arrive.
C. Scatter-Gather (SG)
The SG part is responsible for receiving and decoding SG Lists (SGLs), and distributing them into their associate accelerators. The SG part is composed of the following components. Fig. 2 . Accelerator controller and interleaved RX/TX SG manager SG Decoder: A SGL is constructed of a list of addresses of memory pages with their associated data lengths; while usually, the length of the first and last SG is less than a memory page size, the length of the other SGs is equal to the page size. To shorten the size of the SGL, we compact it with skipping the length of middle SGs. When a SGL arrives, SG decoder extracts pairs of addresses and lengths from a SGL. We call each pair of address and length a SG element.
SG Distributor: The inputs to SG distributor are SG elements and the information from request information queue. D. Accelerators Controller Acc Controller: In an accelerator controller, for each input, one RX data buffer, and for each output, one TX data buffer exists. Figure 2 shows a detailed block diagram of the accelerator controller and data transfer parts. Conventionally, RX/TX buffers must be large enough to store all the data for one accelerator request. Considering the rate of input/output data from PCIe and the rate of the data process in the accelerator, buffers could be more optimized. However, the optimization requires a careful profiling of the accelerators processing rate and the data rate transfer. On the other hand, still for most of the accelerators with a low processing rate the size of the buffers would be relatively large. Defining large buffers in the BRAMs for the accelerator invocation framework does not leave enough space for designers to place more accelerators in an FPGA.
To overcome this problem, we define one RX SG queue and one TX SG queue for each input and output, respectively. Thus, each accelerator controller stores the whole list of SG elements. Then, an accelerator issues one RX request if there is enough space available in the RX data buffer, also issues one TX request if there is enough data available in the TX buffer. This mechanism allows defining RX/TX data buffers with much smaller sizes. This size must be at least equal to the size of one memory page of the host which is the maximum length of one SG element. To prevent an accelerator stall due to waiting for RX data or a free TX buffer, we define buffers with the size of a few numbers of memory pages.
Other than resolving the problem with large buffer sizes, our mechanism of handling SGL allows providing a scheduling strategy for serving different accelerators. It allows serving SG elements from different accelerators in any order. We provide a scheduler in the data management part (section III-E) based on a configurable priority list.
E. Data Transfer
The data transfer part is responsible for providing RX/TX data for accelerators. Together with the accelerator controller part, the RX/TX data would be fetched/submitted from/to the DMA engine for each SG element. It is notable that the path for RX and TX data is completely separated and at the same time an RX and a TX request can be responded to and the data movement be accomplished. In the following different components of the data transfer part are introduced.
RX/TX SG Requester: These components submit a request related to one SG element to the DMA. The request includes an address and a length. Data Request Information: For each RX request, a data request information queue stores the information related to the request to be used when the corresponding data arrives. This information allows the data distributor component to submit the data to the correct accelerator.
IV. EXPERIMENTAL RESULTS

A. Experimental setup
To synthesize and implement UltraShare, we use Xilinx ® Vivado ® 2018.2 design tool. We exploit a 7v3-alpha-data board which has a Xilinx ® Virtex 7 XC7VX690T with a PCIe Gen3 connector. Our host is a PC with an Intel ® Core TM i5-4590 CPU @ 3.30GHz.
We have implemented UltraShare in the pure Verilog programming language. Thus, UltraShare is not limited to any specific vendor tool or platform. We have deployed a command-based data interface scheme in our software stack similar to [13] , allowing multi-core parallel access to accelerators. UltraShare can be used with other available commandbased software platforms like Xilinx SDAccel. It only requires the software stack to submit single commands that follow UltraShare command structure including an accelerator type field for managing accelerators.
In our applications, we use two APIs, one for calling accelerators and one for waiting for their completions. We measure the throughput of the accelerators by measuring the end-to-end delay of processing requests. This delay is measured in the software application from the point that accelerator is called until the completion is received.
B. Benchmarks
To evaluate UltraShare, we exploit two streaming accelerators: an image/video processing accelerator and a network packet encryption algorithm.
Image processing: We exploit RGB-to-YCbCr, a standard streaming image processing IP core from Xilinx ® with the standard AXI4-Stream interface. This accelerator can be configured, at the design time, for different resolutions of images/videos. We define three different types of RGB-to-YCbCr converter, resolution 240 × 180, resolution 480 × 320, and resolution 960 × 640. While for these accelerators the computation algorithm is the same, the input/output sizes and the computation latency over a single input are different.
Network packet encryption: Encryption is a streaming computation intensive algorithm that can be a good candidate to be accelerated on FPGAs. We use an AES-128 encryption algorithm from MVD ® cores over different videos with different resolutions. Unlike the RGB-to-YCbCr accelerator, the same AES accelerator can operate over different input sizes.
C. Results
1) Dynamic accelerator allocation:
To explore the impact of dynamic versus static accelerator allocation, we compare UltraShare with Riffa [10] . Riffa is the only open-source framework which is available to be used. While ST-Accel, a recently proposed framework, has automated the process of generating and connecting accelerators to the applications, the mechanism of accelerator allocation and data transfer in ST-Accel is very similar to Riffa. Riffa is not capable of handling multiple requests from different applications to a single accelerator. Thus, to compare with Riffa, we use one multi-threaded application and use a semaphore mechanism to manage requests to the same accelerators. The chosen accelerator is an RGB-to-YCbCr 480×360. For Riffa, different scenarios of static accelerator allocations are experimented. For example, the worst case scenario which is all the three threads are requesting only for one of the accelerators and the best case scenario that each thread requesting for a different accelerator, and so on. Comparing to the worst case of the static accelerator allocation in Riffa, we observe more than 3x improvement in the throughput. It is notable that this is just a simple scenario to show the impact of a dynamic accelerator allocation. In a more complicated scenario, a static accelerator allocation can drastically degrade the performance.
2) multi-queue grouping accelerators: To show the impact of multi-queue grouping accelerators on removing accelerators idle times, we implemented three types of accelerators: two from RGB-to-YCbCr converter, for resolutions 240 × 180 and 480 × 320, and one AES accelerator that we submitted video frames with the resolutions of 240 × 180 to it. From each of these accelerator types, we implemented 3 instances. Thus, totally 9 accelerators are implemented on our FPGA. Table I compares the throughput of UltraShare versus a non-grouping single-queue implementation. As we described in section III-A the multi-queue mechanism that is proposed by UltraShare decreases the idle times of accelerators and allows them to process a request when at least one request is available. In this experiment, we used three different applications, each requesting to one of the accelerator types. In a single-queue non-grouping implementation, the slowest accelerators will block other accelerators to be assigned to the available requests. Thus, all the accelerators will be limited to the throughput of the slowest accelerator. It is notable that in our experiment, RGB-to-YCbCr 240 × 180 accelerator has only a slightly higher throughput. The reason is that for this accelerator, due to smaller input sizes, the associated user application can prepare and submit more requests comparing to the other applications. Thus, more requests from this application will be ended in the shared command-queue. As Table I shows UltraShare with a uniformed priority weights provides around 8x improvement in the throughput of the fastest accelerator type by removing the idle times caused by the slower accelerators.
3) Resource utilization: To show the scalability of Ultra-Share, we measured the resource utilization with a various number of accelerators and a various number of accelerator groups. Among all the resources, only the utilization and variation of LUTs and BRAMs are considerable. Increasing the number of accelerators and accelerator groups have a linear impact on both the number of LUTs and BRAMs. For one group of accelerators the number of LUTs increases from 700 To show the accelerator sharing feature of UltraShare among different applications, we implemented three instances of AES encryption accelerator on the FPGA. Then we provided three different applications: application 1 that submits requests for a video with the resolution of 240 × 180, application 2 that submits requests for a video with the resolution of 480×360, and application 3 that submit requests for a video with the resolution of 960×640. Then we considered three different scenarios (Figure 3 ). In scenario a, we ran only one of the applications at a time and measured the throughput of the accelerators for each application. In scenario b, we ran two different applications simultaneously and we considered all the three possible combinations of two applications from three applications. In scenario c, we ran all the applications simultaneously. Considering the throughput of the processed frames for each application in these three scenarios, presented in Figure 3 , we can see how the accelerators are evenly shared among the applications. Although due to the different input sizes the number of processed frames for the different applications are different (when the input size is larger, it takes longer time to be processed), Figure 4 scenario c shows that the accelerator usage for all the three applications are equal. On the other word, the difference in throughput is due to the different request sizes which require different computation latency.
V. CONCLUSIONS
In this paper, we proposed UltraShare, an FPGA-based accelerator hardware controller to enable dynamic accelerator sharing among multiple host applications. UltraShare provides a scalable dynamic accelerator allocation scheme to exploit Fig. 4 . Normalized AES accelerators usage by three different applications submitting three different video resolutions dynamic parallelism for the requests from a single application.
Using an accelerator grouping scheme, UltraShare removes accelerator idle times to improve the total throughput gained from a multi-accelerator system. UltraShare also deploys a single command-based request mechanism that addresses a non-blocking accelerator sharing environment for different host appellations to share FPGA accelerators. Experimental results show that in a simple scenario with 9 accelerators from 3 different types, UltraShare provides up to 8x throughput improvement for streaming applications comparing to a singlequeue non-blocking implementation.
