Abstract-FPGA-based accelerators are becoming first class citizens in data centers. Adding FPGAs in data centers can lead to higher compute densities with improved energy efficiency for latency critical workloads, such as financial applications. However FPGA deployment in datacenters brings difficulties both to application developers, and cloud providers. Application writers need to deal with the interfacing of FPGAs on top of application logic/algorithms. On the other hand, cloud providers are reluctant face the risk that their hardware remains underutilized, due to the lack of a sharing mechanism for FPGAs. In this paper, we introduce VineTalk, a framework that reduces the programming effort associated with FPGA-based accelerators and FPGA virtualization. We integrate VineTalk with the Xilinx SDAccel development framework and we map it to the Kintex UltraScale FPGA. Our preliminary evaluation with a use-case of financial applications shows that VineTalk can offer effective FPGA sharing introducing less than 4% overhead to application execution time.
I. INTRODUCTION
The cost-effectiveness of commodity hardware has led to horizontal scaling of modern datacenter applications: Whenever an application challenges the limits of a single server, one can simply get more performance by using more servers in parallel. As a result, today most large scale datacenter applications are hosted by scale-out deployments of commodity servers, either in private datacenters or in public clouds. However, recent trends, both in application requirements and technology, challenge the sustainability of this scale-out approach. First, applications are becoming more computationally intensive. There is an increasing use of compute intensive workloads, such as machine learning and deep learning algorithms where matrix multiplications dominate [1] . Second, with the current projections for unprecedented data growth, applications will need to continue on the scale-out path. As a result, CPU processing is becoming a main limitation in datacenters [2] .
Recently, a response to these trends has been the use of application-specific accelerators in datacenters. The use of such acceleration units can increase performance and efficiency of datacenters significantly, while maintaining power and hardware costs under control. In the last couple of years there have been several reconfigurable architectures proposed 0 
Institute of Computer Science (ICS) -Foundation for Research and
Technology Hellas (FORTH) 1 Institute of Communication and Computer Systems (ICCS) -National Technical University of Athens (NTUA) for the acceleration of cloud computing applications in datacenters. However, there are three main challenges that limit the use of FPGA resources in the cloud.
The first challenge is the high programming effort. To address this issue in hardware design, FPGA vendors have already modernized hardware design procedures by replacing traditional HDL languages, such as VHDL and Verilog, with higher level languages, such as OpenCL and C++. Despite this progress however, hardware design still remains a challenge for software engineers.The second main challenge is the interfacing of FPGA resources to software applications. For each application, typically there is a complex and platform specific communication layer that needs to be considered carefully to transfer data to the FPGA, allocate and release memory, specify which routine to run, and get the results back. The third challenge is the lack of virtualization mechanisms in FPGAs, which allows operators to achieve better utilization of their hardware. However, virtualization of FPGA resources presents two main challenges. First, it requires dynamic reconfiguration, which is hard to achieve transparently. Second, even when applications use the same configuration on a FPGA, it is a challenge to coordinate the shared access to the FPGA.
In this paper we present VineTalk, a software layer between FPGAs and applications that reduces the complexity of communication between application software and accelerator hardware, and allows sharing of FPGA across applications. Furthermore, VineTalk allows applications to transparently access FPGA accelerators, regardless of whether they run natively on a server, within virtual machines, or containers. In this work we focus on FPGA sharing among different applications that use the same FPGA configuration, and we leave as a future work the complete virtualization of FPGAs with dynamic reconfiguration on top of our device sharing infrastructure. Our system consists of two major components: a Communication Layer that implements the virtual accelerators, and a Software Controller that runs as a user-level process and controls/schedules all accesses to the accelerator. Applications communicates with the virtual accelerators using Softwarefacing API, while the Software Controller with accelerators, by the Hardware-facing API, integrated with Xilinx SDAccel development environment.
The main contributions of this paper are the following: 1) A Software-facing API, which exposes FPGA accelerators as task queues to applications. 2) A Hardware-facing API, simplifies the porting of kernels for hardware developers. II. RELATED WORK The use of heterogeneous systems comes at a significant cost: the increase in programming complexity at different levels. To overcome issues related to programming the FPGA itself, developers can employ High Level Languages, such as OpenCL or System C [3] , [4] . However, little has been done to reduce the effort in incorporating the use of FPGAs in applications and services. Recent research has examined various techniques to partition FPGAs so they can be used by multiple applications. Our goal is to allow applications to share simultaneously to each FPGA partition. In [5] , a novel framework is presented that integrates reconfigurable accelerators in a standard server with virtualized resource management and communication. Unlike their approach, VineTalk places the accelerator controller in the host, completely avoiding dependencies with a hypervisor. In [6] , a runtime system has been proposed to simplify the FPGA application development process by providing a high-level API and a simple execution model that supports both software and hardware execution. VineTalk uses a host core for scheduling tasks/accelerators, which leaves all FPGA accelerator resources available to apps.
III. THE VineTalk FRAMEWORK Our design consists of a Software-facing API (Section III-A), Hardware-facing API (Section III-D), a Communication Layer (Section III-B) based on shared memory and a Software Controller (Section III-C). Figure 1 presents the design of VineTalk. A. Software Facing API Our Software-facing API replaces the multitude of all platform-specific acceleration APIs, all of which provide functions that handle memory management and data and task transfers between applications and hardware accelerators. The implementation of the API is completely decoupled from accelerator details. VineTalk achieves this by using three main abstractions: VineAccelerators, VineTasks, and VineBuffers.
A VineAccelerator is a virtual accelerator that can execute kernels to one or more physical accelerators. When a VineAccelerator is created by an application, the application specifies the kernel that it needs to execute, which we assume that it preloaded in a repository in VineTalk local node. There is no limit on the number of VineAccelerators that an application can invoke. Although VineTalk can remove unused kernels from the FPGA and instantiate new kernels as requested by applications, we do not explore this further. After a VineAccelerator has been allocated, an application can issue VineTasks. A VineTask represents a kernel with its input/output data. VineTasks are not statically mapped to a physical accelerator. Consequently, a single physical accelerator can achieve higher utilization by executing VineTasks from multiple VineAccelerators. VineBuffers are used to handle transparently the transfer of the data between an app's address space to the physical accelerator. The Software-facing API enables applications to access these abstractions through a set of methods as in Table I .
B. Communication Layer
VineTalk's Communication Layer implements and manages VineAccelerators and VineBuffers. Applications run as separate processes (or VMs) from the Software Controller, in a single node. Therefore, VineBuffers and VineTasks need to be transported across address spaces. To achieve this, we use a shared memory-based transport. VineTalk uses shared memory to store all VineTasks and VineBuffers. The advantage of shared memory within a server is that after the setup phase there is no need to use system calls. This transport approach relies on shared segments that can be mapped across native processes, containers, and VMs. Our shared memory approach currently introduces two additional copies to the shared memory segment, when sending/receiving data to/from the accelerator memory, which is negligible according to our results.
C. Software Controller
The Software Controller is a process that controls all accesses to the underlying hardware. It monitors VineAccelerators for issued VineTasks and utilizes VineBuffers to retrieve the inputs and store the outputs. Moreover, it enables accelerator sharing, since it offloads multiple VineAccelerators to the same accelerator. The Software Controller assigns a UNIX thread (i.e. accelerator thread) to each physical accelerator existing in the system. This accelerator thread, first selects the VineAccelerator that is going to serve, based on a scheduling policy (currently round-robin). Second, it pops the first VineTask from the selected VineAccelerator, and executes it to its physical accelerator. Then, it copies the result to the shared memory segment, and serves the next VineAccelerator. D. Hardware Facing API VineTalk allows hardware designers to incorporate new kernels by using mainly two functions: VT2Accel() and Accel2VT(). VineTalk currently provides a number of different implementations of this simple API to cover kernels for different accelerators, including FPGAs and GPUs. For FPGA devices, VineTalk implements this API in OpenCL and SDAccel, whereas for GPU devices VineTalk implements this API in OpenCL and CUDA. Porting applications to VineTalk consists of two steps; The first is, to create a VineTalk library for each kernel and for each VineTalk library to create a function that contains the kernel invocation. The second is, the modification of the application to replace all accelerator related functions with the corresponding methods from the Software-facing API. VT2Accel() prepares the input data for an accelerator kernel. A hardware designer is expected to provide this method for each new kernel. The function allocates input VineBuffers on the accelerator memory and copies the contents of VineBuffers of the communication layer. The method will then be used prior to kernel execution by the host controller. Similarly, Accel2VT() is called once after the kernel execution finishes. Its goal is to send the output back to VineBuffers and to release the reserved acceleration memory.
IV. INTEGRATION WITH SDACCEL Xilinx has recently released the SDAccel framework, which is a development environment for OpenCL applications that targets Xilinx FPGA-based accelerator cards. It provides an interface between software applications and FPGA devices. The application consists of a host program written in C/C++ and one or more accelerated kernels written in C, C++, or the OpenCL language that run on the underlying FPGA board.
VineTalk intervenes between the application software side and the hardware side of SDAccel and simplifies the development of applications that use FPGA accelerators as well as incorporating new FPGA kernels to applications. The SDAccel specific implementation of the Hardware-facing API (see Section III-D) allows any SDAccel kernel to be used with applications using VineTalk, with no hardware dependencies. To evaluate the coding effort benefits of VineTalk, we port three financial applications, Black&Scholes, Black-76 and Binomial (described in Section V). We use SDAccel to build three hardware accelerated variations of the aforementioned algorithms. We also wrote and evaluated a simple application, which interfaces with those kernels, submits tasks and data, and reports the results. To port a SDAccel application we create a VineTalk library for each kernel. For each library we use the Hardwarefacing API to simplify the kernel invocation. Moreover in the application side, we replace all SDAccel specific functions with the corresponding methods from VineTalk's Softwarefacing API. The resulting application consists of 30% fewer lines of code and it uses semantically much simpler routines.
V. PERFORMANCE EVALUATION A. Experimental Setup
For our experiments, we use one Intel(R) Core(TM) i5-4590 machine running at 3.3GHz, with 16 GBytes of DRAM, and one ADM-PCIE-KU3 FPGA Alpha Data board, with 16 GB DDR3, connected to PCI Express Gen3 x8. The system runs CentOS 7 with SDAccel version 2016.4. In our evaluation we use three finacial kernels: Black&Scholes, Black-76, and Binomial. Black&Scholes gives a theoretical estimate of the price of European-style options and can also be used for Americanstyle call options. Black-76 is a variant of the Black&Scholes model. Binomial option pricing quantizes time and price of an underlying asset, and maps both to a binary tree. We perform each experiment with 2000 options and with varying batch size between 1 and 512. The batch size represents the number of consecutive options that are transfered from the application space to the FPGA's memory in a single transfer. We exclude from our results the FPGA reconfiguration overhead, which amounts to 6.15 sec.
Black&Scholes and Black-76 have four inputs and one output per stock, and with a batch size of 1, the input size of a batch is 16 bytes (4 x 4 bytes) and the output size is 4 bytes. On the contrary, Binomial uses 5 inputs and 1 output, and for batch size of 1, the input size of each batch is 20 bytes (5 x 4 bytes) and the output size is 4 bytes.
B. VineTalk overhead
We compare the execution of the aforementioned applications with VineTalk versus the standalone SDAccel execution (Native), to identify VineTalk's overhead. Figure 2 presents the normalized job execution time in milliseconds. Job execution times are averages over 20 runs, after removing the minimum and maximum values. VineTalk adds negligible overheads for small batch sizes while it adds a slight penalty for larger ones as shown in Figure 2 . For Black&Scholes and Black-76, the overhead added from VineTalk is between 0.5% and 4% when the batch size is 512, while for batch sizes between 1 and 32 the overhead is negligible. Binomial has an overhead between 0.45% and 0.9% for all batch sizes. In all cases, the main source of the overhead are the two additional data copies (inputs and outputs) required by VineTalk in the shared memory segment. However, although the overhead of those transfers is constant for most experiments, as they transfer similar amounts of data in aggregate, the impact to the execution of each experiment varies. Runs with larger batch sizes take significantly (by up to two orders of magnitude) less time to execute, and thus, they become more sensitive to the overhead of memory transfers. Table II demonstrates this difference in the execution time for various batch sizes. The table summarizes the overall application execution time for two batch sizes, 1 and 512, for Native and VineTalk. As the batch size increases, the application performance increases significantly for both systems. VineTalk incurs the lowest overhead with the Binomial application because it has longer task execution times (and thus lower communication to computation ratio) when compared to the other kernels. Black&Scholes and Black-76 are less compute intensive than the Binomial kernel. Which results to greater transfer to compute ratio of Black&Scholes and Black-76 than the Binomial. Consequently, the extra copies have a stronger effect on the execution time of VineTalk. To evaluate the impact of accelerator sharing, we run concurrently up to two instances (jobs) of each application. Limitations of the current testbed, and more specifically the number of cores, does not allow us to run more concurrent instances. Concurrent job execution is possible only with VineTalk, which can interleave tasks belonging to different applications. For Native, we execute the 2 applications sequentially, one after the other. In each experiment, all application instances invoke the same kernel. Each run (i.e. the sum of one or two applications) consists of 2000 options and batch size 50. In the first run (with one job), the job consists of 2000 options, whereas in the second run the two concurrent jobs consists of 1000 options. Table III compares the total serialized execution time (i.e. Native) with VineTalk-powered total execution time. We also present the Native's to VineTalk total execution time ratio.
For all applications the execution time ratio is very close to 1, consequently the overhead added from VineTalk is negligible. For Black & Scholes and Black-76, the VineTalk to Native job execution time ratio, for 2 concurrent jobs is 0.97, and 1.02 for the Binomial application. Thus multiplexing applications with VineTalk does not introduce overheads.
VI. CONCLUSIONS In this paper we demonstrate how FPGAs can be used transparently in datacenter servers. In particular, we present how FPGAs can be shared by multiple applications. We design and implement VineTalk, a system that provides a hardwareagnostic abstraction and in fact, is designed to be used with different accelerators, including FPGAs and GPUs. VineTalk uses an RPC-like API and a communication channel based on shared memory to allow low-overhead, shared access from applications to accelerators. Our approach is orthogonal to FPGA partitioning and can allow multiple applications to share each partition in an FPGA. Our results show that VineTalk reduces both programmer effort at the application level by reducing lines of code related to kernel invocation by about 30% with significantly simpler semantics and introduces overhead between 0.9% and 4% compared to native application execution over the FPGA. Finally, VineTalk provides the ability of accelerator sharing from consolidated applications, with less than 2% overhead.
VII. ACKNOWLEDGMENTS This work has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 687628 1 Moreover, the Xilinx University Program (XUP) for the kind donation of the software and hardware platforms.
