ABSTRACT Domain-disparity between CPU and Hardware Accelerators(HA) leads to CPU under-utilization and inter-domain data copy overheads. By exposing HA memory to OS and host MMU, these overheads can be eliminated. In this paper, we present a shared virtual memory real system design for PCIe-based HAs to enable parallel heterogeneous execution in CPU and HAs without driver overheads. We extend Linux with a custom memory manager and scheduler to manage HA memory and application-cores respectively. Our FPGA-based multiapplication logic design supports simultaneous execution of multiple heterogeneous applications. We show the advantages of heterogeneous execution and analyze how our design reduces OS overhead.
INTRODUCTION
Over the past decade, Hardware accelerators(HAs) have become very popular and made their way into embedded (TI-OMAP [1] ) and integrated designs (Intel SandyBridge[2], AMD Fusion [3] ). Recent studies [4, 5] have shown that a heterogeneous computing platform comprising multi-cores and accelerators offers higher throughput than a CPU-only or a HA(Hardware Accelerator)-only platform. However, heterogeneous platforms pose significant challenges to software developers. Achieving optimal application performance on these heterogeneous platforms requires a hybrid execution environment which can harness the computation capabilities of multiprocessors and accelerators simultaneously. This would be easier to achieve if operating systems could identify and manage HA cores and their local memory as compute capable execution and memory units respectively.
Recent studies [6, 7] show that traditional driver-oriented approach used for managing HAs endures significant twocopy overhead for transferring data from application buffer to accelerator memory via kernel buffers. Zero-copy mechanisms [8] and user level drivers [9] eliminate most of the driver stack overhead, but the disparity in CPU and accelerator domain still necessitates an inter-domain data copy (Fig.  1a) between the host and accelerator memory. Therefore, most heterogeneous execution approaches adopt a task graph partitioning technique wherein a control thread executing on multicore CPUs dispatches a data-parallel task entirely to accelerators, thus wasting CPU resources. Some work on task partitioning between CPU and accelerator have been done using simulation [10] and custom hardware emulation [11] , but these works neither perform data partitioning, nor do they provide a PCIe-based real system implementation. Twinpeaks software platform [4] provides a better utilization of CPU resources through heterogeneous execution. However true heterogeneity and maximal utilization of CPU and accelerator resources can be achieved only in a shared virtual memory (SVM) prototype like Exochi [12] , in which both CPU and HA can access entire physical address space and work on the virtual address space of applications.
In our work, we show how any PCIe-supporting system can be extended to an asymmetric shared virtual memory system (Fig.1b) and support heterogeneous application execution without incurring data-copy or driver overheads. Our design extends the Operating system support for managing accelerators on-board memory, just like host memory. We also design a kernel-level scheduler which manages execution queues for multiple application cores on our custom FPGA-based hardware design. Since PCIe communication is a bottleneck, SVM is considered harder to achieve on the PCIe-based systems. However,in this paper, we present detailed analysis of the PCIe communication overheads incurred for various data parallel application execution scenarios. We demonstrate the feasibility and benefits of an SVM model on a hyperthreading-enabled quad-core xeon-based PCIe-supported system connected to a ML605 accelerator board, using numerous applications and case-studies.
OUR DESIGN

Hardware Design
Figure 2 depicts hardware logic design hosting multiple application cores. A burst-capable shared AXI bus serves high throughput memory accesses to application cores, direct PCIetransactions and DMA-driven PCIe transactions. Burst-master capable application cores generate memory transaction onto AXI on behalf of application logic engine. The slave registers of every application core are PCIe-mapped to OS kernel space and are updated with tasks by a kernel-level scheduler thread (section 2.2). Beside the actual application logic engine, each application core runs an execution engine as shown in Fig 2. The execution engine consists of FIFO buffers and a control logic. The control logic manages the execution flow within the application logic. It probes for the pending tasks in the slave interface registers and issues burst read/write request to the master interface. Each input and output has a corresponding FIFO buffer. Control logic channels the input and output data to corresponding buffers. The control logic is also responsible for triggering the application logic engine upon input data availability and stall it to avoid buffer underflow. The logic engine begins when the execution engine detects a ready to execute task in slave registers. The input and output data addresses obtained from task descriptor is used for burst memory accesses. Our host-driven ASVM software design handles the allocation and initialization apriori for input and output, thus obviating the overhead of a memory management module running within the accelerator. After size bytes of data, as specified in descriptor, are computed and output FIFO flushed out, execution engine updates a completion flag in a status register and halts the logic engine. It also triggers a signal to the PCIe-interface logic, which in turn triggers an MSI interrupt over PCIe-bus indicating a task completion.
Software Design
Accelerator memory management
We developed a custom kernel module, Accelerator Memory Manager (AMM) which virtualizes the accelerator memory just like the Linux memory manager (LMM) virtualizes host memory. During bootup, AMM builds a set of synthetic page tables in host memory and populates them with accelerator physical addresses for the entire address range assigned to accelerator by the BIOS. Each synthetic page table entry corresponds to a 4KB aligned physical page in accelerator memory.
AMM is also responsible for managing the assignment of synthetic pagetables to demanding processes. Upon a memory allocation demand (using APIs, section 2.3), LMM allocates a fraction of the requested memory in the host memory. Rest of the request is fulfilled by AMM from accelerator memory. AMM does this by hooking synthetic page tables to the process page directory (as shown in fig 5) . AMM maintains a free list of available synthetic pagetable pages. Upon demand, AMM hooks one or multiple of these pagetables to the application page directory. It then moves the assigned synthetic pagetable identifier to another Used list and associates it with the Process ID of the executing process. If AMM fails to allocate from accelerator memory, memory is allocated from the host memory.
Accelerator task scheduling
Accelerator execution requests from various processes are managed by our Accelerator task scheduler (AccScheduler) kernel module. AccScheduler is invoked when a an application task completion is detected by the Interrupt handler. AccScheduler is also invoked by our custom system call from heterogeneous application threads, indicating task availability for HA cores. Upon invocation, AccScheduler puts the invoking task in a queue and checks for an available accelerator thread in the schedule table. If found, it dispatches the task over PCIe by populating the slave registers of relevant application core with task descriptors. Our scheduler currently manages the per-HA logic core task queue using a FIFO policy. However, time-quantum based, priority based or QOS-aware policies can also be implemented on these queues. 
Heterogeneous Application execution
To enable heterogeneous parallel execution on our SVM platform, we instrument the applications with custom APIs. The execution flow of the resulting heterogeneous application is as shown in Fig 3. We heterogenize existing parallel application code by just instrumenting them with three custom API calls. First applications allocate heterogeneous data arrays using our custom acc malloc system call. This allocates the virtual address space for the data by invoking LMM and AMM. Then, the task threads are spawned by application and dispatched to CPUs, either to be run as a CPU thread, or to be scheduled to an HA core by AccScheduler. In our design, we dedicate one thread to handle HA execution by invoking AccScheduler through acc exec (another custom system call). An optional synchronization call sync het is required before post computation by CPU on HA output. The data partitioning and number of CPU threads forked determine the overall performance of the application.
RESULTS AND ANALYSIS
We implemented our design on a quad-core xeon platform connected to Xilinx ML605 board over PCIe 2.0. Our hardware design implemented on Virtex6 FPGA uses 4x PCIe communication channels and run on a clock frequency of 100 MHz. We also instrumented the linux 3.1.10 kernel with patches for AMM, AccScheduler and FS module extension. We executed seven embedded applications using different approaches. In this section, we demonstrate how our ASVM-based design offers improvement over traditional techniques for heterogeneous execution.
Application performance gains
For analysing the advantage offered by our design, we study the FIR compute kernel. In the test code, data read from file is directly used for computation. Our heterogenized FIR application allocates a net 128 MB input data split between host and HA memory. The split ratio is passed as a parameter to the acc malloc call. Due to the SVM design, data can be initialized directly from file in the destination address (host or HA memory). We extend the file-system (FS) module to enable direct transfer of data from FS kernel buffers to destination address. When FS module finds the current Process ID and destination virtual address in the AMM used list, it initiates a DMA transfer for the HA-bound buffer data. Thus whenever, a HA-bound buffer data is ready, our onboard DMA engine is configured by FS module for asynchronous transfer. The destination physical address is obtained through page table translation. While write-back data is written back directly from host and HA memory (using DMA engine) to FS buffers and then to disk. Figure 4a shows the performance of heterogeneous FIR application for varying CPU to HA data split ratio with 4 CPU threads executing in parallel. A heterogeneous execution with 9:1 HA to CPU task ratio provides better performance than HA-only execution for FIR (with 4 CPU threads). However this ratio varies with applications depending on the level of speedup it can achieve in accelerators. Also, the ratio reduces for for massive multicore systems as more of accelerators task can be offloaded to idle CPUs.
The DMA reads over PCIe are asynchronous and parallel with the FS buffer read from disk. We observe (in Fig.  4b ) that the direct data initialization in HA memory does not incur the significant overheads of an inter-domain data copy in various driver-based approaches. Our design offers better performance than Serial DMA (one page transfer at a time) and Scatter-Gather DMA (SGDMA) driven zero copy approaches. Our FS extension reduces the application turnaround time (sum of file I/O and actual execution time). Like file read, file write also evades the redundant copy and adds to improvement in application turnaround time. Since outputs may or may not be written back to file, we report only the file read related results in this paper. Figure 5 and 6 shows the heterogeneous execution performance for 7 embedded applications for 128 MB of input data. We heterogenize these applications for simultaneous CPU and HA execution using our APIs. Due to their data parallel nature, they scale well for varying data sizes and there is minimal inter-thread communication over PCIe. A heterogeneous execution involving both FPGA and CPU shows improvement of up to 25% over CPU-only or FPGA-only execution. We executed parallelized codes on CPUs by equally distributing the data allocated for CPU execution among the CPU threads. Our design incurs a negligible OS overhead for scheduling and dispatching task to FPGA. The lightweight interrupt handler wakes up the sleeping CPU thread and proceeds to join. The overhead of thread wake up and join are negligible compared to the execution time for the huge datasets.
Heterogeneous Execution
CONCLUSION
Hardware accelerators have huge compute-potential, but traditional driver-oriented approach limits their performance due to data transfer overheads. In this paper, we proposed an asymmetric shared virtual memory design which enables applications to directly allocate and initialize data in the accelerator local memory by extending user space to accelerator memory. Our design employs a memory manager and an execution manager to manage heterogeneous execution at the kernel level. In our experiments with hetergeneous implementations of embedded applications, we observed upto 15% improvement in total application execution time by avoiding data copy and OS overheads due to redundant data initialization. We also achieve upto 25% improvement in execution time of data-parallel functions over CPU-only or FPGA-only execution.
