Abstract-While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systemson-chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures.
Ç

INTRODUCTION
M ODERN embedded systems-on-chip (SoCs) are heavily based on heterogeneous architectures coupling a feature-rich, general-purpose, multi-core CPU (the host) with massively parallel, programmable many-core accelerators (PMCAs) such as Epiphany [1] , MPPA Manycore [2] , STHORM [3] , Rigel [4] , and PULP [5] . These architectures are nominally capable of extremely high GOPS/Watt targets, provided that the software can make an effective use of the available hardware resources. The burden of achieving this cumbersome task is nowadays entirely left in the hands of the application programmers. What poses probably the most difficult obstacles in traditional accelerator programming is the complex memory systems adopted by heterogeneous designs resulting in partitioned memory models between host and PMCA. A regular host application sees memory as a flat resource. Memory management units (MMUs) and coherent caches are used to transparently manage the underlying virtual memory system. PMCAs typically feature local, private, physically addressed scratchpad memories (SPMs). Writing a program for heterogeneous systems thus implies orchestrating offload sequences from the host to the PMCA that require explicit data management. This includes programming direct memory access (DMA) engines to copy data to/from the PMCA and manually maintaining data consistency with explicit coherency operations such as cache flushes. While certain classes of applications exhibit regular memory access patterns, for which it is relatively simple to identify workload partitioning strategies that are amenable to DMA transfers, other types of applications adopt completely irregular memory patterns based on the traversal of complex data structures (trees, lists, etc.) which imply data-dependent access to memory and are often impossible to predict statically.
The heavily control-oriented nature of graph-based programs (e.g., pointer-rich data structures, pointer chasing, etc.) poorly matches with the the single instruction, multiple data (SIMD) execution model of general-purpose graphics processing units (GPGPUs) and their performance susceptibility to divergent control flow. Executing such programs on a GPGPU requires to completely re-think and rewrite the original algorithms and data structures, often resorting to redundant computations. While state-of-the-art automatic parallelization approaches [6] can simplify this cumbersome task and achieve comparable performance to manually tuned implementations, the latter only provide modest speedups for this type of applications.
The class of PMCAs targeted by this work is well suited also for applications with control-dominated computation implied by pointer-chasing execution patterns. The main difficulties with offloading pointer-intensive computations to such PMCAs reside on sharing irregular data structures with the host. Without virtual memory support for the PMCA, the host needs to copy the entire data structures into a physically contiguous, un-paged, un-cached memory section at offload time. On top of that, any virtual address stored inside the data structures (typically initialized on the host side) needs to be adjusted to point to the copy in physically contiguous memory. Practically, this requires traversing the entire data structures at runtime, which not only hampers programmability, but also kills performance. Indeed, the overheads associated to such operations are considerable. For example, for PageRank [7] , a representative example of a pointer-chasing application, we measured overheads in execution time and code size of 78 and 90 percent, respectively. Sharing data between host and PMCA in this scenario cannot be efficiently achieved through data copies.
Initiatives such as the Heterogeneous System Architecture foundation (HSA) [8] are pushing for an architectural model where the host processor and the accelerator(s) communicate via coherent, shared virtual memory. The HSA memory architecture moves management of host and accelerator memory coherency from the developer's hands down to the hardware. This enables direct and transparent access to system memory from both sides, eliminating the need for explicit management of different memories. In this scenario, an offload sequence simply consists of passing virtual memory pointers to shared data from the host to the accelerator, in the same way that shared memory parallel programs pass pointers between threads running on a CPU.
Slowly but surely, heterogeneous shared virtual memory (SVM) is becoming a reality. In fact, the major vendors of high-end heterogeneous SoCs have products on the market that support SVM according to the OpenCL 2.0 specification [9] , [10] , [11] , [12] . To this end, these SoCs feature input/output memory management units, i.e., dedicated hardware blocks to do the virtual-to-physical address translation for the GPU's requests to shared memory. As opposed to the memory management units (MMUs) found inside CPU cores, these input/output MMUs (IOMMUs) are shared among many GPU processing elements (PEs) and DMA engines. The need to support concurrent accesses to shared memory without blocking in case of a miss in the translation cache substantially increases design complexity. Similar to the cache hierarchies in multi-core CPUs, multiple heterogeneous translation caches are combined to form complex hierarchies allowing parallel access and to lower overall miss rates and hardware cost [13] . Page-table walker (PTW) engines are massively multithreaded to do multiple virtualto-physical address translations in parallel [13] , [14] . A large number of miss status holding registers and buffer memories are required to finish outstanding shared memory transactions once the corresponding miss has been handled and to absorb missing DMA transfers, respectively [15] , [16] . To make these complex hardware architectures to scale up to hundreds or thousands of PEs is a major design challenge. Hardware-managed coherency further adds to the design complexity [16] . While heterogeneous, coherent SVM can be justified in the context of high-end systems, it is probably not affordable for PMCAs targeting low-power embedded SoCs-just like data caches and associated coherency protocols, which are typically replaced by software-managed SPMs for increased scalability and maximum energy efficiency.
In this work, we propose an integrated hardware/software solution to allow virtual main memory sharing between the host and the PMCA in a low-power SoC. On the hardware side, we propose a simple Remapping Address Block (RAB). The RAB is a software-managed input/output translation lookaside buffer (IOTLB), used by the accelerator to translate virtual addresses (as seen by the application) to their physical counterpart in main memory. As opposed to a full-fledged IOMMU found in high-end, heterogeneous SoCs supporting SVM, the setup of the RAB is entirely done in software via a kernel-level driver module, which is controlled by the user-space runtime linked to the runtime system of a programming model such as OpenMP or OpenCL. Memory coherency is software managed similar to basic, coarse-grained SVM specified by OpenCL 2.0 [17] . In order to support zero-copy offloading of kernels relying on pointer-rich data structures, we implement a compiler pass to automatically protect the accelerator's accesses to shared data elements. This is achieved with calls to low-overhead tryread() and trywrite() functions that validate the response of the shared memory accesses using a low-latency access, special-purpose register. In case of a RAB miss, the calling processing element (PE) of the accelerator is put into sleep, and a miss-handling routine in the kernel-level driver module running on the host is scheduled. We verified our solution with real-life applications relying on pointer-rich data structures, annotated with OpenMP directives for heterogeneous execution. The results demonstrate that for non-strictly memory-bound applications the cost of our solution is negligible if compared to an ideal IOMMU solution.
The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes the architecture template targeted in this work. Our mixed hardware/software solution is presented in Section 4. The results are given in Section 5. Section 6 concludes the paper.
RELATED WORK
The support of shared virtual memory (SVM) for accelerators in embedded systems has not yet been widely addressed by the research community. The internals of the solutions for SVM employed in today's embedded and high-end SoCs are not known to the public in detail [9] , [10] , [13] . Nevertheless, there are related works in several fields.
Accelerator-Based Computing. Motivated by the lack of high-level abstractions in heterogeneous parallel programming models, which requires programmers to resort to complex data copying and synchronization schemes, the research community has come up with various proposals for easing programmability and improving performance. Examples include a runtime system and architecture support for simple and efficient data exchange [18] as well as an integrated message passing framework targeting end-toend data movement among CUDA, OpenCL and CPU memory spaces [19] . An overview of current heterogeneous systems and development frameworks [20] concludes that most works focus on outsourcing compute-intensive tasks entirely to accelerators, leaving the host CPU idle while the accelerators are busy. Letting both accelerators and CPUs collaboratively share work significantly increases the system's efficiency. This can be achieved by using, e.g., a heterogeneous workload partitioning scheme [21] .
A recent survey analyzing state-of-the-art research and implementation of high-performance computing algorithms [22] concludes that most algorithms best suited for CPUs typically feature low arithmetic computation and complex memory handling. Possible examples are map reduce, dynamic programming, combinational optimization problems, and graph processing. What prevents them from being implemented on GPGPUs and FPGAs are design reuse, CPU dependency, on-chip memory size, recursion, overall complexity, synchronization, area-expensive floating point units (FPUs), and highly irregular memory access patterns. Our work targets the tight integration of PMCAs with higher flexibility than GPGPUs and FPGAs into heterogeneous SoCs supporting SVM and fine-grained task offloading. Extending PMCAs with shared FPUs [23] can make such platforms attractive targets for many of these algorithms -albeit at a different problem scale.
Physically Contiguous Virtual Memory. The most simple scheme for zero-copy memory sharing is to allocate a large physically contiguous memory section to hold the shared data. Virtual-to-physical address translation then simplifies to applying a constant offset. There is no need to handle IOTLB misses or page faults. Aiming at avoiding virtual memory overheads in long-running, big-memory workloads, direct segment [24] establishes per-process, direct mappings between regions contiguous in both virtual and physical memory. It is however less suitable for dynamic execution environments where many processes execute for short periods. With the contiguous memory allocator (CMA), Linux features a mechanism to allocate large chunks of physically contiguous memory at boot time [25] . A kernel-level driver may then request memory from this pre-allocated section and give access to it to user-space applications through, e.g., an mmap() system call. Using huge memory pages for the shared data [26] , [27] , [28] can mitigate the performance penalty due to TLB misses. While such techniques allow for zero-copy data sharing, they still require explicit translation of shared virtual address pointers. Moreover, they all suffer from limited performance predictability, especially for embedded systems with limited amounts of physical memory and fragmented address spaces. It is even possible that a pre-allocated memory region cannot be made available at all [29] . Consequently, these techniques cannot prevent the need for true SVM support for accelerators in heterogeneous embedded SoCs when it comes to processing of pointer-rich data structures.
SVM for Accelerators. SVM Frameworks for GPGPUs, such as CUDA's unified virtual memory, let the compiler and the runtime handle the management of the shared main memory and lead to notable code simplifications and performance gains [30] . Due to the completely closed software stacks, the internals of such frameworks are not known to the public. Hardware IOMMU designs for high-bandwidth network and storage devices rely on kernel-level drivers to pin the shared pages in memory and to build up a dedicated I/O page table exclusively accessed by the IOMMU. Recent works [31] , [32] target at mitigating the two main bottlenecks of such systems, i.e., the scalability of I/O virtual address (de-)allocation and IOTLB invalidation. Such systems are optimized for streaming large chunks of data in high-throughput, multi-Gb/s scenarios and less suitable for fine-grained data sharing between a host processor and a PMCA: The cost of setting up and removing a single mapping [32] are comparable to what we achieve with purely software-based IOTLB management, despite dedicated hardware IOMMUs including page-table walker (PTW) engines and prefetching logic.
To fully exploit hardware IOMMUs in heterogeneous computing systems and turn their potential into a performance boost, the hardware/software interaction needs to be reduced to a minimum. Instead of relying on a driver to pin shared pages and build up a dedicated page table for the IOMMU, the IOMMU itself must coherently operate on the process' page table and participate in TLB invalidation and shootdowns, just like a regular CPU core of the host. Such schemes have been studied in the context of highperformance systems featuring discrete FPGA-based accelerators [33] or GPGPUs [14] , [15] . As opposed to discrete highperformance accelerators connected to high-end host systems through PCIe I/O links, the PMCAs addressed in this work have a tight area and power budget and are highly integrated into heterogeneous embedded SoCs. Instead of trying to optimally exploit a full-fledged hardware IOMMU for maximum performance, the focus lies on designing a mixed hardware/software solution for SVM that minimizes the hardware cost and still delivers sufficient performance.
The only work in the embedded systems domain which presents results about the hardware complexity of the FPGA implementation [34] targets full virtualization of multiple, parallel accelerator devices with fast context switching in high-performance SoCs using a full-fledged IOMMU. Its TLB offers acceleration to multiple concurrent caching services for the translation process, but the IOMMU itself only supports one outstanding IOTLB miss. If another IOTLB miss happens while the PTW is busy, the IOMMU blocks any accelerator-to-host communication which impedes adoption to highly scalable PMCAs. In contrast, our solution simply enqueues missing requests to the miss-handling routine running on the host. Software miss handling may be slower compared to a hardware PTW. But independent of the number of enqueued misses, the IOTLB in our solution continues to service requests to shared main memory from other, unblocked cores and DMA engines. Fig. 1 shows the block diagram of the heterogeneous embedded system template targeted in this work. It consists of a powerful general-purpose multi-core CPU (the host), featuring a multi-level cache-coherent memory hierarchy and capable of running full-fledged operating systems. To improve overall performance/Watt, the host is coupled to a programmable many-core accelerator (PMCA) composed of several tens of simple processor elements (PEs), where critical computation kernels can be offloaded [3] , [35] , [36] . The type of PMCA that we consider leverages a multicluster design to overcome scalability limitations [2] , [3] , [37] . The PEs within a cluster share an L1 instruction cache and an L1 data scratchpad memory (SPM). The SPMs of all clusters and a shared L2 SPM are mapped in a global, physical address space. The host and the PMCA physically share the main memory [8] , meaning that they both have a physical communication channel to the main dynamic random access memory (DRAM), as opposed to a more traditional accelerator model where the latter uses a private DRAM. For improved data and computation locality, both the host and the PMCA leverage the internal memory hierarchy to keep most frequently accessed data in fast, local storage. The host does so relying on hardware-managed caches. The PMCA leverages multi-channel, high-bandwidth direct memory access (DMA) engines and double-buffering schemes. While such schemes allow to efficiently overlap data movement with actual computation performed on data in the lowlatency L1 SPMs, they usually require heavy application refactoring. Alternatively, a software cache [38] can be used to automate data movement and to avoid the need for handoptimized DMA programming.
TARGET HETEROGENEOUS ARCHITECTURE TEMPLATE
To simplify programmability, an IOMMU may be placed in front of the accelerator allowing the host and the PMCA to exchange virtual shared data pointers [8] , [39] . While the hardware IOMMUs found in today's high-end heterogeneous SoCs allow for zero-copy memory sharing, they are not for free. The need to support concurrent accesses of the various PEs to shared memory without blocking in case of IOTLB misses leads to a high overall design complexity. For reasons of power consumption and circuit area, the practicable size of fully-associative IOTLBs is limited to typically less than 64 entries. To further reduce the miss rate at lower cost, complex multi-level IOTLB hierarchies combining IOTLBs of different size, latency and cost are used [13] . PTWs are massively multithreaded to serve multiple IOTLBs and misses in parallel [13] , [14] , and miss status holding registers allow for outstanding read and write requests to be finished once the corresponding misses have been handled [15] . Additional buffer memory is required to also absorb missing DMA requests. For example, in a PMCA using AXI4-based interconnects [40] with address and data widths of 32 and 64 bits, respectively, the IOMMU would already require 19 KiB of buffer memory for a single cluster featuring 16 PEs and 1 DMA engine with 4 channels (max. transfer size of 4 KiB). Designing an IOMMU capable of absorbing DMA requests from an increasing number of clusters thus has obvious scalability limitations.
While fully hardware-managed SVM seems to be the way to go for high-end systems, the associated costs might not be affordable for embedded systems targeting the low-power domain. Typically, such systems rather split the main memory into two sections. One being managed and accessed exclusively by the host and a second, physically contiguous and un-paged, un-cached section that is also accessible by the PMCA. The host is required to copy any shared data between the two memory sections, and to adjust any virtual address pointer in the shared data to point to the copy in the physically contiguous, un-cached memory section. This hampers programmability and kills performance. The shared virtual memory solution presented in this work seeks to enable the performance and programmability of a hardware-managed design at the cost of a the latter solution prevalent in today's embedded systems.
LIGHT SHARED VIRTUAL MEMORY SUPPORT
In this section, we describe the hardware/software infrastructure enabling lightweight shared virtual memory between host and PMCA in a heterogeneous embedded system matching the template described in Section 3.
Remapping Address Block
To enable the host and the PMCA to communicate through shared virtual memory, the PMCA's network interface (NI) is connected to the host's main memory interconnect through a multi-ported Remapping Address Block (RAB), similar to the IOMMU shown in Fig. 1 . It is basically a software-managed input/output translation lookaside buffer (IOTLB) which is used by the PMCA to translate virtual, host addresses of outgoing memory transactions to physical addresses. As opposed to a full-fledged IOMMU such as [34] , [39] , the RAB does not feature dedicated prefetching and page-table walker (PTW) hardware to guess the virtual addresses of upcoming transactions and to do the corresponding virtual-to-physical address translation ahead of time or in case of an IOTLB miss. Doing without prefetching and PTW hardware and associated caches often used to speed up these operations allows to substantially reduce design complexity of the RAB, which is key for constrained embedded systems. The setup of the IOTLB is completely done in software by the kernel-level driver module running on the host as described in Section 4.2. Fig. 2a shows a schematic of the RAB. It supports multiple ports, each of which comprises private AXI4 master and slave interfaces, and a private IOTLB with a parameterizable size implemented using a fully-associative contentaddressable memory (CAM). Currently, the RAB is optimized for low latency and high flexibility. As opposed to a conventional IOTLB, each entry of the RAB, which we define as a slice, can hold one arbitrary sized mapping, independent of the host's page size. This allows multiple pages to be remapped using a single RAB slice if they are contiguous in both virtual and physical memory. To this end, also the virtual end address needs to be stored besides the virtual and the physical start address of the mapping which increases the size of the CAM. Per slice, two protection flags define whether the slice is enabled/disabled and if read and/or write access is permitted. To set up the slices of the individual RAB ports, the host uses a shared AXI4-Lite configuration interface. If a new memory transaction arrives at an AXI4 slave interface, the requested virtual address as well as the transaction size are fed to the corresponding IOTLB to check if one of the slices is holding a valid configuration. If this is the case, the IOTLB returns the corresponding physical address which is then used to issue the translated memory transaction at the master interface. If no valid configuration is found or if the protection flags do not allow the requested transaction, it is simply dropped and an interrupt is sent to the host. A slave error is signaled back to the issuing master in the AXI Read/Write Response signal. The missing address and the transaction ID are written to two separate first in, first out buffers (FIFOs) which can be read by the miss-handling driver routine through the AXI4-Lite configuration interface. The CAM itself has a look-up latency of one cycle.
The tryx() Operation
The RAB is managed completely in software by a kernellevel driver module running on the host. The interaction of the different abstraction layers is visualized in Fig. 2b . When the heterogeneous application is started, it must first reserve those addresses in its own virtual address space that overlap with the physical address space of the PMCA 1 1 . This step is required to make sure that the host never passes a virtual address pointer to the PMCA that overlaps with its own address space, since the PMCA would not route an access to such a virtual address through the RAB but instead route it internally, e.g., to its internal memories or memory-mapped registers. After copying the PMCA executable to the internal L2 memory, the driver can start the PMCA and pass the virtual address of the shared data elements to the PMCA using a memory-mapped mailbox 2 .
On the PMCA side, accesses to data elements shared with the host (i.e., residing in main memory) need to be protected with calls to low-overhead tryread() and trywrite() functions. In the following, the term tryx() is used to refer to both these functions. First of all, they simply issue the load or store to the address of interest 3 . In case of a RAB miss, i.e., if no valid mapping for this address is set up in the RAB, the RAB i) stores the missing address and the transaction ID, ii) sends an interrupt to the host, and iii) returns 0 to the issuing PE in the read data signal in case of a load, and signals a slave error in the read/write response signal 4 . This response is then forwarded to a private, specialpurpose, low-latency access TRYX control register placed close to the PE's data interface. To validate the response returned by the RAB and to check whether a miss happened, the tryx() functions issue a read to this register 5 . In case of a miss, the PE goes to sleep and waits for the host to handle the miss 6 .
To handle the RAB miss on the host, the kernel-level driver uses the Concurrency Managed Workqueue API of Linux. Upon receiving a RAB miss interrupt, the interrupt handler simply enqueues the miss-handling routine to a dedicated workqueue 7 . The worker thread executing this routine is then scheduled in normal process context 8 . This is required because some kernel functions executed by the routine may sleep, and therefore cannot be executed in interrupt context. Once started, the routine reads the missing address and the transaction ID from the corresponding FIFOs in the RAB. After locking the requested user-space page in memory using get_user_pages(), the virtual-to-physical address translation is performed, and a new RAB slice for the missing mapping can be configured. In case all RAB slices are in use, the routine simply removes the oldest mapping and unlocks the corresponding user-space page before setting up the new remapping (FIFO replacement strategy). The remapped page is flushed from the host's data caches back to DRAM and the sleeping PE can be woken up. The miss-handling routine then continues to handle misses until the FIFOs inside the RAB are empty. After waking up, the PE can safely repeat the load or store to the address of interest 9 and then continue normal program execution.
In the case of a RAB hit, the memory request of the PE 3 is simply forwarded to shared memory 4 . The PE validates the response returned by the RAB by reading the TRYX control register 5 and continues program execution.
Compiler and Runtime Support
Modern programming models are evolving to simplify the development of applications for heterogeneous systems. The application developer can focus on partitioning the computation and specifying the parts to be offloaded to the PMCA-using, e.g., OpenMP offload directives [41] or the lower-level programming style of OpenCL [17] while the underlying compiler and runtime system take care of implementing the desired heterogeneous and parallel semantics. The proposed framework relies on a program transformation implemented in the compiler that transparently protects (via a call to the tryx() functions) the access to those data items that are accessed from the host (main) memory. The key idea that we leverage to build this compiler pass is that computation offloading in any programming model is specified through some sort of language construct to specify which data originated in the host execution environment are later communicated to the PMCA. As an example, we consider the following code snippet written using the OpenMP 4.0 specification [41] .
#pragma omp target mapðverticesÞ map ðto : n verticesÞ {/* offloaded code */} The target directive is used to syntactically highlight which parts of the host program are to be compiled 1. This can be achieved with an mmap() system call with the flags MAP_FIXED, MAP_ANONYMOUS and PROT_NONE to get exactly the reserved address segment, not back the mapping with any physical memory, and not contribute to the kernel's overcommit limit.
and executed on the accelerator. The map clause allows to list a set of host variables (scalars, arrays or parts of arrays) that the PMCA has access to in read-only (map (to: <list>)), write-only (map(from: <list>)) or read-write mode.
Starting from this information, the proposed compiler pass annotates all the uses of such variables within the target block (i.e., the code that will execute on the PMCA, and that thus needs protection of accesses going to shared memory). Fig. 3 shows the transformation process on a simplified excerpt of PageRank [7] . Two host variables are annotated for PMCA access: vertices (Variable 1) and n_vertices (Variable 2). Variable 2 is of type int, thus the compiler just emits a tryread() function call on its address. The analysis is more complex on Variable 1, which is of a pointer type and pointing to a struct containing several fields (many of which pointers), as it usually happens for pointer-chasing access patterns. Here, not only do we need to follow the use-definition chains of the map variables, but we also need to apply escape analysis to determine when dereferencing a pointer is interpreted as an address whose value is further propagated through the program. A statement where such an address is read into temporary storage is thus marked as an escape point, and the analysis is recursively applied.
Concretely, Variable 1 in Fig. 3 
().
DMA transfers are protected inside the DMA transfer setup routine of the PMCA runtime. If the cluster-external address is virtual, the routine inserts a tryx() call to every memory page touched by the DMA before starting the transfer. If the cluster-external address is physical, the transfer can be setup right away. PMCA-internal DMA transfers need no protection.
Comparison with Existing Solutions
The mixed hardware/software solution presented in this work implements all the functionality required for shared virtual memory and is fully functional. How it compares with hardware IOMMUs is shown in Table 1 .
The IOMMUs found in today's high-end desktop CPUs [42] , [43] and embedded systems [13] are fully hardware managed. In case of an IOTLB miss, a dedicated 
TABLE 1 Comparison of Shared Virtual Memory Solutions
hardware block walks the page table to do the virtual-tophysical address translation. Typically, some sort of least recently used (LRU) or Pseudo-LRU replacement algorithm is used to decide which IOTLB entry to replace. Such replacement algorithms allow to reduce the IOTLB miss rate, but they require hardware support to monitor the usage of the individual IOTLB entries. In contrast, the software routine handling the misses in our design has no knowledge on the usage of the IOTLB entries besides the setup time. The only viable option is thus a simple FIFO algorithm.
The advantage of our design is that, irrespective of the number of currently outstanding misses, the IOTLB can continue to serve translation requests without blocking the PMCA's traffic to shared memory. In hardware-managed IOMMUs, this is mitigated using complex hardware designs including multithreaded PTWs [13] , [14] , and hierarchical IOTLB architectures. For example, IOMMUs targeting high-throughput I/O devices [42] , [43] support both IOTLBs local to the IOMMU as well as remote IOTLBs inside the I/O device. ARM's system-level MMU-500 [13] uses per-port L1 and a shared L2 IOTLB. Per IOTLB, multiple outstanding misses are supported. In GPGPUs, IOTLB misses usually happen in big bursts. To allow other warps to access the shared memory while the misses of stalled warps are handled, every compute unit (CU) consisting of many PEs features a private IOTLB including multiple miss status holding registers. The PTW is either shared but massively multithreaded [14] or private per IOTLB [15] .
It is important to note that hardware IOMMUs not automatically guarantee best performance. For example, the PTWs in today's IOMMUs targeting high-throughput I/O devices [42] , [43] cannot operate on the process' page table. Instead, they require software drivers to build up a dedicated I/O page table which is a major bottleneck addressed by recent works [31] , [32] . The support for interrupt remapping and multiple device contexts, i.e., multiple host applications concurrently offloading kernels to the PMCA, is currently not implemented in our design.
RESULTS
We first present our evaluation platform, i.e., our embodiment of the target architecture template presented in Section 3. The cost of the proposed SVM solution are compared to alternative designs in Section 5.2. Section 5.3 gives the results for three real heterogeneous applications based on pointer-rich data structures that were run on the evaluation platform, compared to an ideal SVM implementation. In Section 5.4, we introduce a synthetic model to predict the performance of our solution and apply it to the MiBench [45] embedded benchmark suite.
Evaluation Platform
Our evaluation platform is based on the Xilinx Zynq-7045 All Programmable SoC [46] . Fig. 4 gives an overview. The programmable system of the Zynq SoC features interconnects, peripherals, a DRAM controller, and the application processor unit with a dual-core ARM Cortex-A9 CPU at its heart. Each core has separate L1 instruction and data caches with a size of 32 KiB each. Further, the application processor unit offers 512 KiB of shared L2 instruction and data cache including controller that connects to the high-priority port of the DRAM controller. The programmable system is used to implement the host of the embedded heterogeneous SoC. It is running Xilinx Linux 3.18.
The programmable logic of the Zynq SoC consists of a Xilinx Kintex-7 FPGA which is used to implement PULP: a PMCA developed as an ASIC for Parallel Ultra-Low Power Processing [5] . To overcome scalability limitations, PULP leverages a multi-cluster design. The processing elements (PEs) within a cluster feature 4 KiB of shared L1 instruction cache, and share 72 KiB multi-banked tightly-coupled data memory as L1 SPM. The memory banks are connected to the PEs through a low-latency crossbar interconnect with a wordlevel interleaving scheme to minimize access contention. Ideally, every PE can access one word in the L1 SPM per cycle. The L1 SPMs of all clusters as well as the 256 KiB L2 globally shared SPM are mapped in a global, physical address space, meaning that the PEs can also access data in remote L1 SPMs, albeit with a higher latency. Every cluster features a lightweight, low-programming-latency, multi-channel DMA engine which allows for fast and flexible movement of data between L1 and L2 memory or shared DRAM [47] . The event units inside the clusters' peripherals are used for both intraand inter-cluster synchronization, and to put PEs into sleep and to wake them up. PULP is attached to the host as a memory-mapped device and controlled by a kernel-level driver module and a user-space runtime. The host and the PMCA share 1 GiB of DDR3 DRAM. The RAB uses one port with 32 slices for PMCA-to-host communication and connects the PMCA to the high-performance AXI slave port of the Zynq's DRAM controller. In case of a miss, the RAB signals a slave error back to the cluster internal Per2AXI protocol converter using the AXI Read/Write Response signal. The protocol converter propagates the response signal back to the PEs' private, low-latency access, special-purpose TRYX control registers placed close to the demux, which connect to the PEs' data interfaces.
The main goal of our platform is to study and evaluate the system-level integration of a PMCA like PULP into an embedded heterogeneous SoC. We did not optimize the PMCA for the implementation on the FPGA. The FPGA implementation should more be seen as an emulator instead of a fully featured accelerator. The PMCA has a single cluster comprising four PEs. We adjusted the clock frequencies of the different components to obtain ratios similar to a real heterogeneous SoC with the host and the PMCA running at 2133 MHz and 500 MHz, respectively. The speed of the DDR3 DRAM is selected such as to model a shared last-level cache with a total access latency of 15 and 14 clock cycles for read and write accesses issued by the PMCA, respectively. Table 2 gives an overview of the hardware cost of our solution. In terms of look-up table (LUT) and register slices, the FPGA resource utilization of the RAB is about 8 and 12 percent that of a PULP cluster with four PEs, respectively. The TRYX control register accounts for 0.3 and 0.1 percent of a cluster's LUT and register slices, respectively. Further, the table lists the resources of a full-featured hardware IOMMU [34] . This IOMMU features a 64-entry TLB and has been implemented on a XC6VLX760 device. As expected, the total resource utilization of our solution is much lower than that of the full-featured IOMMU. It is worth noting that the CAM of the IOMMU has a maximum access latency of 6 cycles which allows for an implementation using Block RAM (BRAM) instead of LUT and register slices. In contrast, the CAM of the RAB has a look-up latency of 1 cycle.
Shared Virtual Memory Cost
Moreover, the RAB allows for mappings of arbitrary size, independent of the page size. Constraining the RAB to page sized mappings only, and relaxing the access latency of the CAM would allow to further reduce hardware cost. Finally, the IOMMU supports only one outstanding IOTLB miss. If another miss happens while the PTW is busy, any traffic to shared memory is blocked. In contrast, the RAB simply enqueues the missing address to the software routine on the host and continues to service requests from other unblocked cores and DMA engines.
We used a synthetic benchmark application to profile the primitives of our solution for shared virtual memory (SVM). The host allocates a large array and passes a pointer to this array to the PMCA using the mailbox. The PEs of the PMCA then issue read and write accesses to this array. To measure the tryx() hit and miss times, the application uses the PEs' internal performance counters and the clusters' peripheral timers, respectively. The latter are also accessible from outside the cluster and can be used by the kernel-level driver running on the host to profile the RAB miss handling routine. On average, the handling of a RAB miss (n c;miss ) takes 5,500 PMCA clock cycles. This includes the time required by the event unit to put the PE to sleep and to wake it up which is 5 and at most 3 clock cycles, respectively. Roughly 20 percent of n c;miss it takes until the host handles the interrupt. Another 50 percent it takes until the worker thread that actually handles the miss is scheduled in process context. Only 30 percent of the time is spent to walk the page table using get_user_pages() and to update the configuration of the RAB. Running the miss handling routine in process context is required because get_user_ pages() requires to acquire the semaphore protecting the operating system's memory management structure of the userspace process and therefore may sleep. Implementing a custom routine to walk the page table and lock the user-space pages could speed up the RAB miss handling routine by roughly 70 percent. However, this would be at the expense of portability. Our RAB miss handler using Linux kernel APIs only is fully independent of the host hardware. Instead, we optimized our solution for a fast common case, i.e., the tryx() operations hitting in the RAB. The overhead n c;tryx of the tryx() is 8 clock cycles. This is the time a PE requires to read the TRYX control register and to decide whether to go to sleep in case of a miss or to continue program execution.
How the proposed solution compares to alternative shared memory designs is visualized in Fig. 5 . Ideally, the host performs its computation 1:1 and simply passes virtual address pointers to the PMCA 2:1 . With an optimal virtual memory subsystem, the PMCA can access the shared data directly from main memory through the virtual address pointers, and execute the highly parallel application parts without SVM-related overheads 3:1 . Such an ideal SVM subsystem would be a zero-latency IOTLB that always contains the required mapping. In contrast, real SVM solutions such as the IOMMUs found in high-end heterogeneous SoCs [9] , [10] , [11] , [12] are not ideal. Real IOTLBs can have look-up latencies of multiple cycles [34] and handling misses introduces additional latency. For example, letting a hardware IOMMU operating on an optimized, dedicated I/O page table created by a driver module at offload time handling an IOTLB miss takes around 1,500 host or 350 PMCA cycles [32] . In case too many misses are outstanding, any The experiments presented in the following sections aim at quantifying this PMCA execution time overhead.
Without SVM, the host must copy shared data between virtual and the physically contiguous, un-cached memory accessible by the PMCA, and adjust any virtual address pointers in the shared data. This causes substantial overheads at offload time as shown in Fig. 5c 2:2 . Copying a single 4 KiB memory page of data to physically contiguous memory without even modifying pointers takes at least 10,200 PMCA clock cycles. Copying back the page once the PMCA is done takes another 20,500 cycles. Even in embedded systems, the data shared between host and PMCA quickly exceeds the size of a single page. For example, already a low-resolution VGA image in RGB format occupies 225 4-KiB memory pages. Doing data copies quickly becomes a bottleneck in terms of energy and performance. Compared to the copy-based approach typically used by low-power embedded systems, our lightweight SVM solution requires on average 5,500 cycles to handle a RAB miss and is thus at least 5.6x faster.
An alternative design using the contiguous memory allocator (CMA) [25] is visualized in Fig. 5d . A kernel-level driver requests virtual memory from a large, physically contiguous section allocated at boot time and exposes it to the user-space application through mmap(). The host simply passes the virtual addresses of the shared data allocated in the CMA region to the PMCA 2:1 . The PMCA directly accesses the shared data from main memory through these pointers without overhead 3:1 . The virtual-to-physical address translation simplifies to applying a constant offset. However, there are considerable drawbacks associated with this technique besides the reduced programmability. First, there is no guarantee that the CMA region can be made available at runtime [29] . Second, CMA returns un-cached memory on ARM systems which can imply substantial host execution time overheads 1:2 . Depending on the amount of processing done by the host, relying on the copy-based approach may be faster.
Pointer-Chasing Applications
The three real-life applications studied operate on pointer-rich data structures and exhibit irregular memory access patterns that are data dependent, not known in advance and thus not amenable to DMA tiling. Offering high degrees of parallelism and possibly little computation only, they can be communication intensive. Moreover, they represent typical example workloads of the "big data" application space, including graph processing, that is prominent in data centers, and where heterogeneous architectures with configurable accelerator infrastructure such as IBM's CAPI [33] and Microsoft's Project Catapult [48] are gaining traction because of their energy efficiency. In contrast, stream-oriented, data-parallel application kernels traditionally offloaded to GPGPUs can be more memory bound but feature highly regular and predictable access patterns known at offload time. Using softwaremanaged virtual memory sharing together with an IOTLB double-buffering scheme, such kernels-even with very low operational intensities-can be efficiently offloaded to PMCAs [44] . Below, we first give a brief description of each application. 2 Then, we describe the experimental results.
Application Descriptions
PageRank (PR). This algorithm was originally used by Google to rank web sites [7] . Every web site is represented by a vertex, and a link from one site to another is represented by an arc between the two corresponding vertices. The weight or rank of all vertices is initialized to 1=n vertices . The algorithm then iteratively processes the graph. The rank of every vertex is divided by the number of successor vertices and added to their rank. At the end of every iteration, the rank of dangling vertices is equally distributed to all vertices, and all ranks are normalized. The procedure is repeated until the ranks converge. PR is an ideal candidate to use an adjacency list to represent the highly irregular graph, which is not altered during processing. Since the number of computations performed in every vertex is very low, basically a single division and one addition per successor vertex, PR is highly communication intensive. Furthermore, it features low locality of reference and therefore represents a worstcase scenario. Parallelization of PR using OpenMP is achieved on a vertex level. Random Hough Forests (RHFs). The second benchmark application, from the machine learning domain, is the classification stage of an object detector using random Hough forests [49] , i.e., a set of binary decision trees. To detect the bounding boxes of instances of a class in an image, the application computes the corresponding Hough image using RHFs. Once all patches have been classified, a Gaussian filter is applied to the Hough image. The detection hypotheses are found at the maxima locations, and the values at these locations serve as confidence measures.
Face Detection (FD). The third application, also from the machine learning domain, uses the well-known Viola-Jones algorithm [50] . To detect a face in a particular location in an input image, the corresponding image patch is fed to a degenerate decision tree, a so-called cascade. Per node, one or multiple weak classifiers (weaks) are computed. Every weak specifies a simple test to perform on the patch and a threshold. If the weighted sum of the outputs of the weaks is below a node-specific threshold, it is very likely that the patch does not contain a face and the detection is aborted. Otherwise, the patch is fed to the next node, where the same procedure is repeated with different weaks. The cascades are designed such as to reject negative patches with as little computation, i.e., as early as possible in the cascade. The individual classifiers are trained to have a detection rate close to 100 percent, while the false positive rate can be fairly high. The overall high detection accuracy is reached by cascading multiple weaks and cascades.
Measurement Results
The three benchmarks were profiled with different sets of input data. Table 3 lists the details of the different configurations. It gives the size of the pointer-rich data structures in shared memory, the L1 data cache miss rate (measured when executing the graph processing stage on the host), the communication-to-computation ratio (CCR), i.e., the ratio of load/store instructions to the total number of instructions of the offloaded application part, and the RAB miss rate. Fig. 6 shows the execution time of the tested configurations normalized to the execution time achievable with an ideal virtual memory solution, i.e., a system without tryx() overhead and an IOTLB that always contains the required mapping. This can be achieved by copying the pointer-rich data structures to a physically contiguous, un-cached part of the shared memory at offload time, adjusting any pointers in the data structures, and by disabling the compiler extensions inserting the tryx(). As visualized in Fig. 5c , this implies considerable overheads on the host side. For example for PR, we measured execution time and code size overheads of 78 and 90 percent, respectively. Fig. 6 also shows the performance when using a software cache [38] . Besides avoiding the need for hand-optimized DMA programming, this reduces the pressure on the virtual memory system. Only misses in the software cache need address translation and thus protection with tryx() operations. The software cache has an access latency n c;ca of 8 cycles and the cache miss rate was modeled as m ca ¼ 10 percent for all configurations.
PageRank. Even for the very memory-intensive PR, the slowdown due to the tryx() is below 15 percent (blue bars in Fig. 6 ). Considering also misses (red bars), even a marginal increase of the RAB miss rate m TLB leads to an additional slowdown of 65 percent (PR 10 k vertices). The reason for this high impact is twofold. First, PR is memory bound (CCR ¼ 31 percent). Even a very low miss ratio means many misses. Second, PR has low locality of reference (m ca ¼ 8:4 percent). Moreover, the graph with 10k vertices is not sufficiently small to be remapped with the available 32 RAB slices. Since the RAB slices are configured on demand, the very first access to every memory page of the pointer-rich data structure causes a compulsory miss in the RAB. Once all 32 RAB slices are in use, every RAB miss causes the oldest remapping to be replaced by the missing remapping. The next access to the replaced remapping then causes a capacity miss. Due to the low locality of reference of PR, the percentage of capacity misses is very high (% 95 percent) and handling these dominates the overall execution time as indicated in Fig. 6 . Using an LRU replacement strategy helps to reduce the number of RAB misses by up to 11 percent which translates into a speedup of up to 4.5 percent (purple bars). A more effective technique to decrease the impact of RAB misses is the software cache (33 percent speedup). It reduces the pressure on the virtual memory system such that even a very simple FIFO replacement strategy and a non-optimized, but portable miss handler allow for adequate performance (green bars).
Random Hough Forests. The impact of both the misses and the tryx() is much less pronounced, despite of relatively high RAB miss rates. There are two main reasons for the lower susceptibility.First,theoffloadedkernelfeaturesmorecomputations (lower CCR) compared to PR, which allows to amortize the cost of the tryx() and the misses. Second, the last step of the feature extraction operates on data previously copied to the L1 SPM using DMA. The feature extraction does not need to access the shared main memory and needs no protection. Moreover, the kernel has a higher overall locality of reference which is also indicated by the relatively low host cache miss rate m ca of 1.5 percent. Performing LRU replacement reduces the number of RAB misses by at most 2 percent. The performance increases by less than 1 percent. The reason for this minor speedup are the large trees and the low temporal locality of the classification phase. Between two accesses to the same tree node, many other nodes are accessed, which likely causes all mappings to be replaced in the RAB. What can help here is least frequently used (LFU) replacement, explicit locking of RAB entries remapping most frequently accessed parts of the trees such as the root nodes, or the use of a software cache.
Face Detection. FD combines a fairly high CCR with reasonably good locality of reference (m ca 1:6 percent). Consequently, the cost of the tryx() cannot be amortized as well as for RHFs but it is not as high as for PR. This results in a slowdown of 6 percent (blue bars). Since the cascades are sufficiently small to be remapped with the available RAB slices, only compulsory misses occur and the RAB miss rate m TLB % 0:005 percent is substantially lower compared to RHFs. Neither the LRU replacement strategy nor the software cache can significantly increase performance.
Synthetic Model
The cost of the primitives presented in Section 5.2 can be used to predict the execution time overhead of our solution compared to an ideal virtual memory solution for arbitrary applications. Assuming that any computation instruction takes 1 clock cycle, the total execution time in cycles n c;exe of an application with n inst instructions and a given communication-to-computation ratio (CCR) is
with n c;mem being the average number of cycles per memory access. Ideally, the PMCA uses double-buffered DMA transfers between shared memory and its internal L1 SPMs to hide the access latency to shared memory. The PEs' memory operations then all go to the cluster internal L1 SPM and take a single cycle only. The programming effort required to implement such a shared memory access scheme can be very high. To ease programming and still let the PEs access the data from the fast L1 SPMs, a software cache [38] can be used. Given the number of cycles n c;ca to access a word in the software cache, the miss rate m ca of this software cache, the number of cycles n c;LLC to access a word in the shared lastlevel cache (LLC), the overhead n c;tryx of the tryx(), the miss rate of the TLB m TLB , and the number of cycles n c;miss to handle a TLB miss, the number of cycles per memory access can be given as n c;mem ¼n c;ca þ m ca ðn c;LLC þ n c;tryx
In the following, we focus on the analysis of the overhead of the tryx() and therefore assume m TLB ¼ 0. The slowdown s compared to an ideal shared virtual memory system can then be formulated using Eqs. (1) and (2):
Fig . 7 shows the slowdown s due to the tryx() operations computed by this formula for different cache miss rates m ca and CCRs. The overhead n c;tryx of the tryx() is 8 clock cycles in case of a RAB hit, and n c;LLC ¼ 15. We selected a software cache access time n c;ca of 8 cycles (¼ n c;tryx ). As one can see, for applications offering a suitable amount of computations (CCR < 30 percent), the slowdown is below 22 percent even for relatively high data cache miss rates of 15 percent. The synthetic model is also in line with the results measured for the heterogeneous applications shown in Fig. 6 . For example, the measured slowdown for PR (m ca 10 %; CCR % 30 %) is below 15 percent (blue bars) while the slowdown predicted by the model is below 17 percent. For FD (m ca % 1:5 %; CCR % 30 %), the measured and the predicted slowdown is 6 and 7 percent, respectively. Finally, this model can be applied to benchmark applications. The MiBench [45] embedded benchmark suite contains several, relatively small kernels from different application domains such as automotive, consumer, network, office, security and telecommunications, which are good candidates to be accelerated by a PMCA. We compiled and ran a majority of the benchmarks on the ARM host. The CCR can be extracted from the instruction distributions found in [45] . It is between 11 and 52 percent for bitcount and typeset, respectively (33 percent on average). For the miss rate m ca of the software cache in the model, the actual L1 data cache miss rate of the applications can be used. The miss rates were measured using a custom kernel module accessing the performance monitor units of the ARM. They vary between 0.2 percent (PGP) and 2.2 percent (lame, typeset). The slowdown for the various applications is shown in Fig. 8 . It can be seen that for many applications, the expected slowdown is below 2 percent. A notable slowdown can only be seen for applications featuring both a CCR above 40 percent and a cache miss rate greater than 2 percent. If the CCR is reasonably low, i.e., if the application offers enough computations to amortize the cost of the tryx(), or if the cache can effectively filter the traffic to the shared memory, the slowdown of our lightweight virtual memory solution is negligible. 
CONCLUSION AND FUTURE WORK
In this work, we have presented a lightweight, mixed hardware/software solution enabling shared virtual memory support for many-core accelerators in embedded heterogeneous SoCs. On the hardware side, it uses a simple Remapping Address Block (RAB) as an input/ output translation lookaside buffer (IOTLB), to translate virtual addresses (as seen by the application running on the accelerator) to their physical counterpart in main memory. The IOTLB itself is completely softwaremanaged by a kernel-level driver module running on the host. To support zero-copy offloading of kernels relying on pointer-rich data structures, for which the accelerator's access pattern to shared memory is data dependent and not known in advance, our solution uses a compiler extension to automatically protect the accelerator's accesses to shared data elements with calls to lowoverhead tryx() functions. These functions validate the response of the shared memory accesses using a lowlatency access, special-purpose register. In case of a RAB miss, the calling processing element is put into sleep and woken up after the miss has been handled by the host. The proposed solution is of low hardware complexity and non-intrusive to the architecture of the accelerator cores and hence suitable to be implemented in various accelerators without inherent hardware support for virtual memory. If the offloaded kernels offer sufficient arithmetic computation, the cost of our solution can be amortized and the overhead compared to an ideal virtual memory solution is negligible.
We are currently extending our study in two main directions. On one hand, we are working towards reducing the impact of misses on the system's performance. This includes the design and implementation of a custom miss-handling routine running either on the host or a dedicated microcontroller core inside the RAB to speed up the the miss handling, the implementation of a more advanced replacement strategy as well as the integration of a second IOTLB level to reduce the number of capacity misses at reasonable hardware cost. On the other hand, we are extending our evaluation platform to study the integration of larger accelerator configurations including more PEs and multiple clusters into a more recent ARMv8-based host system. Pirmin Vogel received the MSc degree in electrical engineering and information technology from ETH Zurich, Switzerland, in 2013. Since then, he is working toward the PhD degree at the Integrated Systems Laboratory (IIS), ETH Zurich. His research interests include digital signal processing, heterogeneous computing architectures and embedded systems-on-chip with a focus on operating system, driver, runtime and programming model support for efficient and transparent accelerator programming. He is a student member of the IEEE.
Andrea Marongiu received the MSc degree in electronic engineering from the University of Cagliari, Italy, in 2006 and the PhD degree in electronic engineering from the University of Bologna, Italy, in 2010. He currently is a postdoc researcher at ETH, Zurich. He also holds a postdoc position at the University of Bologna. His research interests concern parallel programming model and architecture design in the single-chip multiprocessors domain, with special emphasis on compilation for heterogeneous architectures, efficient usage of on-chip memory hierarchies and SoC virtualization. He has published more than 50 papers in peer reviewed international journals and conferences. He is a member of the IEEE.
Luca Benini is the chair of digital circuits and systems at ETH Zurich and a full professor with the University of Bologna. He has served as chief architect for the Platform2012/STHORM project in STmicroelectronics, Grenoble. He has held visiting and consulting researcher positions at EPFL, IMEC, Hewlett-Packard Laboratories, Stanford University. His research interests include in energy-efficient system design and Multi-Core SoC design. He is also active in the area of energy-efficient smart sensors and sensor networks for biomedical and ambient intelligence applications. He has published more than 700 papers in peer-reviewed international journals and conferences, four books and several book chapters. He is a fellow of the IEEE and the ACM, and a member of the Academia Europaea.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
