Energy efficiency and computing flexibility are some of the primary design constraints of heterogeneous computing. In this paper, we present FlashAbacus, a data-processing accelerator that selfgoverns heterogeneous kernel executions and data storage accesses by integrating many flash modules in lightweight multiprocessors. The proposed accelerator can simultaneously process data from different applications with diverse types of operational functions, and it allows multiple kernels to directly access flash without the assistance of a host-level file system or an I/O runtime library. We prototype FlashAbacus on a multicore-based PCIe platform that connects to FPGA-based flash controllers with a 20 nm node process. The evaluation results show that FlashAbacus can improve the bandwidth of data processing by 127%, while reducing energy consumption by 78.4%, as compared to a conventional method of heterogeneous computing.
hardware threads, in turn yielding performance than that of CPUs many orders of magnitude [5, 21, 47] .
While these accelerators are widely used in diverse heterogeneous computing domains, their power requirements render hardware acceleration difficult to achieve in low power systems such as mobile devices, automated driving systems, and embedded designs. For example, most modern GPGPUs and MICs consume as high as 180 watts and 300 watts per device [1, 14] , respectively, whereas the power constraints of most embedded systems are approximately a few watts [8] . To satisfy this low power requirement, ASICs and FPGAs can be used for energy efficient heterogeneous computing. However, ASIC-or FPGA-based accelerators are narrowly optimized for specific computing applications. In addition, their system usability and accessibility are often limited due to static logic and inflexible programming tasks. Alternatively, embedded multicore processors such as the general purpose digital signal processor (GPDSP), embedded GPU and mobile multiprocessor can be employed for parallel data processing in diverse low power systems [25, 35] .
Although the design of low-power accelerators with embedded processors opens the door to a new class of heterogeneous computing, their applications cannot be sufficiently well tuned to enjoy the full benefits of the data-processing acceleration; this is because of two root causes: i) low processor utilization and ii) file-associated storage accesses. Owing to data transfers and dependencies in heterogeneous computing, some pieces of code execution are serialized, in turn limiting the processing capability of the low-power accelerators. In addition, because, to the best of our knowledge, all types of accelerators have to access external storage when their internal memory cannot accommodate the target data, they waste tremendous energy for storage accesses and data transfers. External storage accesses not only introduce multiple memory copy operations in moving the data across physical and logical boundaries in heterogeneous computing, but also impose serial computations for marshalling the data objects that exist across three different types of components (e.g., storage, accelerators, and CPUs). Specifically, 49% of the total execution time and 85% of the total system energy for heterogeneous computing are consumed for only data transfers between a low-power accelerator and storage.
In this paper, we propose FlashAbacus, a data processing accelerator that self-governs heterogeneous computing and data storage by integrating many flash modules in lightweight multiprocessors to resemble a single low-power data processing unit. FlashAbacus employs tens of low-power functional units rather than hundreds and thousands of hardware threads that most manycore accelerators employ. Even though the computing capacity of the small number of functional units is not as high as that of the manycore accelerators, FlashAbacus can process data near flash with different types of applications or kernels that contain multiple functions or both. Such multi-kernel execution can maximize the resource utilization by dynamically scheduling multiple applications across all the internal processors. The kernel scheduler can also reorder the executions in an out of order manner by recognizing the code blocks that have no data dependency across different kernels of the target data-intensive application. Therefore, FlashAbacus offers high bandwidth with a relatively small number of processors, thereby maximizing energy efficiency. A challenge in enabling such multi-kernel execution near flash is that the flash media integrated into the accelerator are not practically working memories; they operate in a manner similar to mass storage. That is, the cores of accelerator cannot execute any type of kernels through typical load and store instructions. Therefore, the accelerator needs to employ an OS, but the OS will make the accelerator bulkier and will worsen the performance characteristics. To address these challenges, we introduce Flashvisor that allows the cores of accelerator to access flash directly without any modification of the instruction set architecture or assistance of the host-side storage stack. Flashvisor virtualizes storage resources by mapping the data section of each kernel to physical flash memory. In addition, it protects flash from the simultaneous accesses of multi-kernel execution by employing a range lock, which requires a few memory resources. We also separate the flash management tasks from their address translations [15, 23, 42, 56] , and allocate a different processor to take over the flash management tasks. This separation makes the flash management mechanisms (e.g., garbage collection) invisible to the execution of multiple kernels.
We built a prototype FlashAbacus on a low-power multicore [29] based PCIe platform that connects FPGA-based flash controllers with a 20 nm node process [78] . To evaluate the effectiveness of FlashAbacus prototype, we converted diverse computing applications [57] into our new execution framework. Our evaluation results show that FlashAbacus can offer 127% higher bandwidth and consume 78% less energy than a conventional hardware acceleration approach under data-intensive workloads.
PRELIMINARIES
In this section, we analyze the overheads of moving data between an accelerator and storage, which are often observed in a conventional heterogeneous computing system. This section then explains the baseline architecture of our FlashAbacus.
Physical and Logical Data Paths
Hardware. Figure 1(a) shows the physical datapath with a CPU, a low-power accelerator, and an SSD in conventional heterogeneous computing. In cases where the accelerator needs to process a large amount of data, the CPU generates I/O requests and issues them to the underlying SSD through I/O controllers such as AHCI [30, 34] . An SSD controller then transfers the data from flash to its internal DRAM, and the host controller again moves such data from the internal DRAM to the host-side DRAM through a storage interface [9, 37, 76] ( 1 ). During this time, the data can be reconstructed and set in order as a form of objects that the accelerator can recognize ( 2 ). Finally, the CPU transfers the data from the host-side DRAM to the internal DRAM of the accelerator through the PCIe interface again ( 3 ). Note that, at this juncture, all kernel executions of the accelerator are still stalled because the input data are in being transferring and are not ready to be processed. Once the data are successfully downloaded, the embedded multicore processors (EMPs) of the accelerator can start processing the data, and the results will be delivered to the SSD in an inverse order of the input data loading procedure ( 3 → 2 → 1 ). The movement of data across different physical interface boundaries imposes the restriction of long latency before the accelerator begins to actually process data and leads to a waste of energy, resulting from the creation of redundant memory copies. In addition, the physical datapath deteriorates the degree of parallelism for kernel executions. For example, a single application task has to be split into multiple kernels due to capacity limit of the internal DRAM of the accelerator, in turn requiring file-associated storage accesses and thereby serializing the execution. Software. Another critical bottleneck for heterogeneous computing is the discrete software stacks, each of them exists for the accelerator and the SSD, respectively. As shown in Figure 1 the host needs to employ a device driver and a runtime library for the accelerator, while it requires a storage stack that consists of flash firmware, a host block adapter (HBA), a file system, and an I/O runtime to employ the underlying device as storage. The runtime libraries for the accelerator and the SSD offer different sets of interfaces, which allow a user application to service files or offload data processing, appropriately. In contrast, the accelerator driver and HBA driver are involved in transferring data between the device-side DRAM and host-side DRAM. Therefore, the user application first needs to request data to the underlying SSD through the I/O runtime library ( 1 ), and then it must write data to the accelerator through the accelerator runtime library ( 3 ). This activity causes multiple data copies within the host-side DRAM. Furthermore, when the file system and accelerator driver receive data from the application, all the data from user buffers must be copied to OS-kernel buffers, which again creates extra memory copies within the host-side DRAM ( 2 ). This problem arises because OS-kernel modules cannot directly access the user memory space, as there is no guarantee that the current OS-kernel module is executing in the process that the I/O request was initiated. In addition to these unnecessary data copies (within the host-side DRAM), the discrete software stacks also increase data moving latency and consume energy because they enforce many user/privilege mode switches between their runtime libraries and OS-kernel drivers.
Baseline Architecture
The datapath analysis in the previous section shows that the redundant memory copies across different devices are unnecessary if we can integrate the SSD directly into the accelerator, as shown in Figure 2a . The accelerator can also execute a series of tasks without an interruption if there is an enough memory space that can accommodate input data. Motivated by these observations, we built an FPGA-based flash backbone with a 20 nm processor node [78] and tightly integrated it into a commercially-available embedded SoC platform [29] , as shown in Figure 2b . Multicore. To perform energy-efficient and low-power data processing near flash, our hardware platform employs multiple lightweight processors (LWPs) [29] , which are built on a VLIW architecture. The VLIW-based LWPs require neither out-of-order scheduling [19, 44] nor a runtime dependency check because these dynamics are shifted to compilers [18, 45] . Each LWP has eight functional units (FUs) that consist of two multiplication FUs, four general purpose processing FUs, and two load/store FUs; thus, we can reduce the hardware complexity of the accelerator, while simultaneously satisfying the diverse demands of low-power data processing applications. In this design, LWPs are all connected over a high crossbar network, and they can communicate with each other over message queue interfaces that we implemented by collaborating with a hardware queue attached to the network [71] . Memory system. Our hardware platform employs two different memory systems over the network that connects all LWPs: i) DDR3L and ii) scratchpad. While DDR3L consists of a large low-power DRAM [55] , the scratchpad is composed of eight high-speed SRAM banks [27] . In our platform, DDR3L is used for mapping the data sections of each kernel to flash memory thereby hiding the long latency imposed by flash accesses. DDR3L is also capable of aggregating multiple I/O requests that head to the underlying storage modules, and feasible to buffer the majority of flash writes, which can take over the roles of the traditional SSD internal cache [74] . In contrast, the scratchpad serves all administrative I/O requests by virtualizing the flash and the entries queued by the communication interfaces as fast as an L2 cache. Note that all LWPs share a single memory address space, but have their own private L1 and L2 caches. Network organization. Our hardware platform uses a partial crossbar switch [62] that separates a large network into two sets of crossbar configuration: i) a streaming crossbar (tier-1) and ii) multiple simplified-crossbars (tier-2). The tier-1 crossbar is designed towards high performance (thereby integrating multiple LWPs with memory modules), whereas throughputs of the tier-2 network are sufficient for the performances of Advanced Mezzanine Card (AMC) and PCIe interfaces exhibit [28] . These two crossbars are connected over multiple network switches [70] . Flash backbone. Tier-2 network's AMC is also connected to the FPGA Mezzanine Card (FMC) of the backend storage complex through four Serial RapidIO (SRIO) lanes (5Gbps/lane). In this work, the backend storage complex is referred to as flash backbone, which has four flash channels, each employing four flash packages over NV-DDR2 [58] . We introduce a FPGA-based flash controller for each channel, which converts the I/O requests from the processor network into the flash clock domain. To this end, our flash controller implements inbound and outbound "tag" queues, each of which is used for buffering the requests with minimum overheads. During the transition device domain, the controllers can handle flash transactions and transfer the corresponding data from the network to flash media through the SRIO lanes, which can minimize roles of flash firmware.
It should be noted that the backend storage is a self-existence module, separated from the computation complex via FMC in our baseline architecture. Since flash backbone is designed as a separable component, ones can simply replace worn-out flash packages with new flash (if it needs). Prototype specification. Eight LWPs of our FlashAbacus operate with a 1GHz clock and each LWP has its own private 64KB L1 cache and 512KB L2 cache. The size of the 8-bank scratchpad is 4MB; DDR3L also consists of eight banks with a total size of 1GB. On the other hand, the flash backbone connects 16 triple-level cell (TLC) flash packages [48] , each of which has two flash dies therein (32GB). Each of the four flash packages is connected to one of the four NV-DDR2 channels that work on ONFi 3.0 [58] . The 8KB page read and write latency are around 81 us and 2.6 ms, respectively. Lastly, the flash controllers are built on the 2 million system logic cells of Virtex Ultrascale FGPA [78] . The important characteristics of our prototype are presented in Table 1 . Figure 3a illustrates a kernel execution model for conventional hardware acceleration. In prologue, a data processing application needs to open a file and allocate memory resources for both an SSD and an accelerator. Its body iterates the code segments that read a part of file, transfer it to the accelerator, execute a kernel, get results from the accelerator, and write them back to the SSD. Once the execution of body loop is completed, the application concludes by releasing all the file and memory resources. This model operates well with traditional manycore-based high-performance accelerators that have thousands of hardware threads. However, in contrast to such traditional accelerators, the memory space of low-power accelerators is unfortunately limited and difficult to accommodate all data sets that an application requires to process. Thus, low-power accelerators demand more iterations of data transfers, which in turn can significantly increase I/O stack overheads. Furthermore, a small memory size of the low-power accelerators can enforce a single data-processing task split into multiple functions and kernels, which can only be executed by the target accelerator in a serial order.
HIGH-LEVEL VIEW OF SELF-GOVERNING

Challenge Analysis
For better insights on the aforementioned challenges, we built a heterogeneous system that employs the low-power accelerator described in Section 2.2 and an NVMe SSD [32] as external storage instead of our flash backbone. Utilization. Figures 3b and 3c show the results of our performance and utilization sensitivity studies in which the fraction of serial parts in kernel executions enforced by data transfers were varied. As the serial parts of a program increase, the throughput of dataprocessing significantly decreases ( Figure 3b ). For example, if there exists a kernel whose fraction of serialized executions is 30%, the performance degrades by 44%, on average, compared to the ideal case (e.g., 0%), thus the accelerator becomes non-scalable and the full benefits of heterogeneous computing are not achieved. Poor CPU utilization is the reason behind performance degradation ( Figure  3c ). For the previous example, processor utilization is less than 46%; even in cases where the serial parts take account for only 10% of the total kernel executions, the utilization can be less than at most 59%. Storage accesses. We also evaluate the heterogeneous system by performing a collection of PolyBench [57] and decompose the execution time of each application into i) SSD latency to load/store input/output data, ii) CPU latency that host storage stack takes to transfer the data, and iii) accelerator latency to process the data 1 . Figure 3d shows that the data-intensive applications (e.g., ATAX, BICG, and MVT ) consume 77% of the total execution time to transfer the data between the the accelerator and SSD. Figure 3e illustrates 
ĐĐĞůĞƌĂƚŽƌ ,ŽƐƚͬhƐĞƌ
WĂƌĂůůĞů ǆĞĐƵƚŝŽŶ that the storage stack accesses, including file system operations and I/O services, consume 85% of the total energy in heterogeneous computing. Note that the data transfer overheads for computationintensive applications (e.g., SYRK and 3MM) are not remarkable, but the corresponding energy consumed by the storage stack accounts for more than 77% of the total energy consumed by the system, on average. The detailed energy analysis of heterogeneous computing will be further discussed in Section 5.3.
To address the above two challenges, FlashAbacus governs the internal hardware resources without an assistance of host or OS. We introduce a multi-kernel execution model and storage monitor to fully utilize LWPs and virtualize flash backbone into processors' memory address, respectively.
Multi-Kernel Execution
Hand-threaded parallelism (using pthread or message passing interface) can offer fine control than the parallel execution, but for such parallelism, it requires to accommodate OS thread management, which is infeasible to our low-power accelerator, have to be accommodated. In this work, we allow all the LWPs to execute different types of kernels in parallel, and each kernel can contain various operational functions. This in turn enables users to offload diverse user applications and perform different types of data processing in concert; this is referred to as multi-kernel execution. Figure 4a shows an example of the multi-kernel execution model. While a conventional kernel is a function that usually has a very simple iteration, our kernel can contain many functions and handle a deep depth of function calls. In this execution model, a host can also offload multiple kernels, which are associated with different applications in our accelerator ( Figure 4b ). While our multi-kernel execution is not as powerful as a thousand hardware thread executions that most manycore accelerators offer, it allows users to perform more flexible data processing near flash and opens up the opportunities to make data processing more energy efficient than in the conventional accelerators.
However, executing different kernels, each with many functions across multiple LWPs, can introduce other technical challenges such as load balancing and resource contention. To address these challenges, one can simply expose all internal LWP's resources to the host so that users can finely control everything on their own. Unfortunately, this design choice can lead to a serious security problem, as an unauthorized user can access the internal resources and put them to an improper use. This approach may also introduce another type of data movement overheads as frequent communications are required to use diverse FlashAbacus resources from outside. Therefore, our accelerator internally governs all the kernels based on two different scheduling models: i) inter-kernel execution and ii) intra-kernel execution. In general, in inter-kernel executions, each LWP is dedicated to execute a specific kernel that performs data processing from the beginning to the end as a single instruction stream. In contrast, the intra-kernel execution splits a kernel into multiple code blocks and concurrently executes them across multiple LWPs based on the input data layout. The scheduling details will be explained in Section 4.
Fusing Flash into a Multicore System
The lack of file and runtime systems introduces several technical challenges to multi-kernel execution, including memory space management, I/O management, and resource protection. An easyto-implement mechanism to address such issues is to read and write data on flash through a set of customized interfaces that the flash firmware may offer; this is the typically adopted mechanism in most active SSD approaches [33, 59] . Unfortunately, this approach is inadequate for our low-power accelerator platform. Specifically, as the instruction streams (kernels) are independent of each other, they cannot dynamically be linked with flash firmware interfaces. Furthermore, for the active SSD approaches, all existing user applications must be modified by considering the flash interfaces, leading to an inflexible execution model.
Instead of allowing multiple kernels to access the flash firmware directly through a set of static firmware interfaces, we allocate an LWP to govern the memory space of the data section of each LWP by considering flash address spaces. This component, referred to as Flashvisor, manages the logical and physical address spaces of the flash backbone by grouping multiple physical pages across different dies and channels, and it maps the logical addresses to the memory of the data section. Note that all the mapping information is stored in the scratchpad, while the data associated with each kernel's data sections are placed into DDR3L. In addition, Flashvisor isolates and protects the physical address space of flash backbone from the execution of multiple kernels. Whenever the kernel loaded to a specific LWP requires accessing its data section, it can inform Flashvisor about the logical address space where the target data exist by passing a message to Flashvisor. Flashvisor then checks a permission of such accesses and translates them to physical flash address. Lastly, Flashvisor issues the requests to the underling flash backbone, and the FPGA controllers bring the data to DDR3L. In this flash virtualization, most time-consuming tasks such as garbage collection or memory dump are periodically performed by a different LWP, which can address potential overheads brought by the flash management of Flashvisor (cf. Section 4.3). Note that, in contrast to a conventional virtual memory that requires paging over file system(s) and OS kernel memory module(s), our Flashvisor internally virtualizes the underlying flash backbone to offer a large-size of byte-addressable storage without any system software support. The implementation details of flash virtualization will be explained in Section 4.3.
IMPLEMENTATION DETAILS
Kernel. The kernels are represented by an executable object [68] , referred to as kernel description table. The description table, which is a variation of the executable and linkable format (ELF) [16] , includes an executable that contains several types of section information such as the kernel code (.text), data section (.ddr3_arr), heap (.heap), and stack (.stack). In our implementation, all the addresses of such sections point to the L2 cache of each LWP, except for the data section, which is managed by Flashvisor. Offload. A user application can have one or more kernels, which can be offloaded from a host to a designated memory space of DDR3L through PCIe. The host can write the kernel description table associated with the target kernel to a PCIe base address register (BAR), which is mapped to DDR3L by the PCIe controller (cf. Figure  2b) . Execution. After the completion of the kernel download(s), the host issues a PCIe interrupt to the PCIe controller, and then the controller internally forwards the interrupt to Flashvisor. Flashvisor puts the target LWP (which will execute the kernel) in sleep mode through power/sleep controller (PSC) and stores DDR3L address of such a downloaded kernel to a special register, called boot address register of the target LWP. Flashvisor then writes an inter-process interrupt register of the target LWP, forcing this LWP to jump to the address written in the boot address register. Lastly, Flashvisor pulls the target LWP out of the sleep mode through PSC. Once this revocation process is completed, the target LWP begins to load and execute the specified kernel. Thus, Flashvisor can decide the order of kernel executions within an LWP across all LWPs.
Inter-kernel Execution
Static inter-kernel scheduling. The simplest method to execute heterogeneous kernels across multiple LWPs is to allocate each incoming kernel statically to a specific LWP based on the corresponding application number. Figures 5a shows that has two user applications, App0 and App2, each of which contains two kernels, respectively (e.g., k0/k1 for App0 and k2/k3 for App2). In this example, a static scheduler assigns all kernels associated with App0 and App2 to LWP0 and LWP2, respectively. Figure 5b shows the Dynamic inter-kernel scheduling. To address the poor utilization issue behind static scheduling, Flashvisor can dynamically allocate and distribute different kernels among LWPs. If a new application has arrived, this scheduler assigns the corresponding kernels to any available LWPs in a round robin fashion. As shown in Figure 5c , k1 and k3 are allocated to LWP1 and LWP3, and they are executed in parallel with k0 and k2. Therefore, the latency of k1 and k3 are reduced as compared to the case of the static scheduler, by 2 and 3 time units, respectively. Since each LWP informs the completion of kernel execution to Flashvisor through the hardware queue (cf. Figure 2) 
Intra-kernel Execution
Microblocks and screens. In FlashAbacus, a kernel is composed of multiple groups of code segments, wherein the execution of each depends on their input/output data. We refer to such groups as microblocks. While the execution of different microblocks should be serialized, in several operations, different parts of the input vector can be processed in parallel. We call these operations (within a microblock) as screens, which can be executed across different LWPs.
For example, Figure 6a shows the multiple microblocks observed in the kernel of (FDTD-2D) in Yee's method [57] . The goal of this kernel is to obtain the final output matrix, hz, by processing the input vector, _fict_. Specifically, in microblock 0 (m0), this kernel first converts _fict_ (1D array) to ey (2D array). The kernel then prepares new ey and ex vectors by calculating ey/hz and ex/hz differentials in microblock 1 (m1). These temporary vectors are used for getting the final output hz at microblock 2 (m2). In m2, the execution codes per (inner loop) iteration generate one element of output vector, hz, at one time. Since there are no risks of writeafter-write or read-after-write in m2, we can split the outer loop of m2 into four screens and allocate them across different LWPs for parallel executions. Figure 6b shows the input and output vector splitting and their mapping to the data sections of four LWPs to execute each screen.
In-order intra-kernel scheduling. This scheduler can simply assign various microblocks in a serial order, and simultaneously execute the numerous screens within a microblock by distributing them across multiple LWPs. Figure 7b shows an example of an in-order inter-kernel scheduler with the same scenario as that explained for Figure 5b . As shown in Figure 7a , k0's m0 contains two screens ( 1 and 2 ), which are concurrently executed by different LWPs (LWP0 and LWP1). This scheduling can reduce 50% of k0's latency compared to the static inter-kernel scheduler. Note that this scheduling method can shorten the individual latency for each kernel by incorporating parallelism at the screen-level, but it may increase the total execution time if the data size is not sufficiently large to partition as many screens as the number of available LWPs.
Out-of-order intra-kernel Scheduling. This scheduler can perform an out of order execution of many screens associated with different microblocks and different kernels. The main insight behind this method is that, the data dependency only exists among the microblocks within an application's kernel. Thus, if there any available LWPs are observed, this scheduler borrows some screens from a different microblock, which exist across different kernel or application boundaries, and allocate the available LWPs to execute these screens. Similar to the dynamic inter-kernel scheduler, this method keeps all LWPs busy, which can maximize processor utilization, thereby achieving high throughput. Furthermore, the out-of-order execution of this scheduler can reduce the latency of each kernel. Figure 7c shows how the out-of-order scheduler can improve system performance, while maximizing processor utilization. As shown in the figure, this scheduler pulls the screen 1 of k1's microblock 0 (m0) from time unit (T1) and executes it at T0. This is because LWP2 and LWP3 are available even after executing all screens of k0's microblock 0 (m0), 1 and 2 . With a same reason, the scheduler pulls the screen a of k1's microblock 2 (m2) from T3 and execute it at T1. Similarly, the screen 1 of k2's microblock 1 (m1) can be executed at T1 instead of T2. Thus, it can save 2, 4 and 4 time units for k0, k1 and k2, respectively (versus the static inter-kernel scheduler). Note that, in this example, no screen is scheduled before the completion of all the screens along with a previous microblock. In FlashAbacus, this rule is managed by multi-app execution chain, which is a list that contains the data dependency information per application. Figure 8 shows the structure of multi-app execution chain; the root contains multiple pointers, each indicating a list of nodes. Each node maintains a series of screen information per microblock such as LWP ID and status of the execution. Note that the order of such nodes indicates the data-dependency relationships among the microblocks.
Flash Virtualization
The kernels on all LWPs can map the memory regions of DDR3L pointed by their own data sections to the designated flash backbone addresses. As shown in Figure 9 , individual kernel can declare such flash-mapped space for each data section (e.g., input vector on DDR3L) by passing a queue message to Flashvisor. That is, the queue message contains a request type (e.g., read or write), a pointer to the data section, and a word-based address of flash backbone. Flashvisor then calculates the page group address by dividing the input flash backbone address, with the number of channel. If the request type is a read, Flashvisor refers its page mapping table entry, which contains the address of physical page group number. It then divides the translated group number by the total page number of each flash package, which indicates the package index within in a channel, and the remainder of the division can be the target physical page number. Flashvisor creates a memory request targeting the underlying flash backbone, and then all the FPGA controllers take over such requests. On the other hand, for a write request, Flashvisor allocates a new page group number by simply increasing the page group number used in a previous write. In cases where there is no more available page group number, Flashvisor generates a request to reclaim a physical block. Note that, since the time spent to lookup and update the mapping information should not be an overhead to virtualize the flash, the entire mapping table resides on scratchpad; to cover 32GB flash backbone with 64KB page group (4 channels * 2 planes per die * 8KB page), it only requires 2MB for address mapping. Considering other information that Flashvisor needs to maintain, a 4MB scratchpad is sufficient for covering the entire address space. Note that, since the page table entries associated with each block are also stored in the first two pages within the target physical block of flash backbone (practically used for metadata [81] ), the persistence of mapping information is guaranteed. Protection and access control. While any of the multiple kernels in FlashAbacus can access the flash backbone, there is no direct data path between the FGPA controllers and other LWPs that process the data near flash. That is, all requests related to flash backbone should be taken and controlled by Flashvisor. An easy way by which Flashvisor can protect the flash backbone is to add permission information and the owner's kernel number for each page to the page table entry. However, in contrast to the mapping table used for main memory virtualization, the entire mapping information of Flashvisor should be written in persistent storage and must be periodically updated considering flash I/O services such as garbage collection. Thus, adding such temporary information to the mapping table increases the complexity of our virtualization system, which can degrade overall system performance and shorten the life time of the underlying flash. Instead, Flashvisor uses a range lock for each flash-mapped data section. This lock mechanism blocks a request to map a kernel's data section to flash if its flash address range overlaps with that of another by considering the request type. For example, the new data section will be mapped to flash for reads, and such flash address is being used for writes by another kernel, Flashvisor will block the request. Similarly, a request to map for writes will be blocked if the address of target flash is overlapped with another data section, which is being used for reads. Flashvisor implements this range lock by using red black tree structure [65] ; the start page number of the data section mapping request is leveraged as a key, and each node is augmented with the last page number of the data section and mapping request type. Storage management. Flashvisor can perform address translation similar to a log-structured pure page mapping technique [42] , which is a key function of flash firmware [3] . To manage the underlying flash appropriately, Flashvisor also performs metadata journaling [3] and garbage collection (including wear-leveling) [10, 11, 38, 60] . However, such activities can be strong design constraints for SSDs as they should be resilient in all types of power failures, including sudden power outages. Thus, such metadata journaling is performed in the foreground and garbage collection are invoked on demand [10] . In cases where an error is detected just before an uncorrectable error correction arises, Flashvisor excludes the corresponding block by remapping it with new one. In addition, all the operations of these activities are executed in the lockstep with address translation.
However, if these constraints can be relaxed to some extent, most overheads (that flash firmware bear brunt of) are removed from multi-kernel executions. Thus, we assign another LWP to perform this storage management rather than data processing. While Flashvisor is responsible for mainly address translation and multikernel execution scheduling, this LWP, referred to as Storengine, periodically dumps the scratchpad information to the underlying flash as described in the previous subsection. In addition, Storengine reclaims the physical block from the beginning of flash address space to the end in background. Most garbage collection and wearleveling algorithms that are employed for flash firmware select victim blocks by considering the number of valid pages and the number of block erases. These approaches increase the accuracy of garbage collection and wear-leveling, since they require extra information to set such parameters and search of the entire address translation information to identify a victim block. Rather than wasting compute cycles to examine all the information in the page valid pages from the victim block to the new block, and return the victim to a free block in idle, which is a background process similar to preemtable firmware techniques [38, 40] . Once the victim block is selected, the page mapping table associated with those two blocks is updated in both the scratchpad and flash. Note that all these activities of Storengine can be performed in parallel with the address translation of Flashvisor. Thus, locking the address ranges that Storengine generates for the snapshot of the scratchpad (journaling) or the block reclaim is necessary, but such activities overlapped with the kernel executions and address translations (are performed in the background).
EVALUATION
Accelerators. We built five different heterogeneous computing options. "SIMD" employs a low-power accelerator that executes multiple data-processing instances using a single-instruction multipledata (SIMD) model implemented in OpenMP [49] , and SIMD performs I/Os through a discrete high-performance SSD (i.e., Intel NVMe 750 [32] ). On the other hand, "InterSt", "InterDy", "IntraIo", and "IntraO3" employ the static inter-kernel, dynamic inter-kernel, in-order intra-kernel, and out-of-order intra-kernel schedulers of FlashAbacus, respectively. In these accelerated systems, we configure the host with Xeon CPU [31] and 32GB DDR4 DRAM [64] .
Benchmarks. We also implemented 14 real benchmarks that stem from Polybench [57] on all our accelerated systems 2 . To evaluate the impact of serial instructions in the multi-core platform, we rephrase partial instructions of tested benchmarks to follow a manner of serial execution. The configuration details and descriptions for each application are explained in Table 2 . In this table, we explain how many microblocks exist in an application (denoted by MBLKs). Serial MBLK refers to the number of microblocks that have no screens; these microblocks are to be executed in serial. The table also provides several workload analyses, such as the input data size of an instance (Input), ratio of load/store instructions to the total number of instructions (LD/ST ratio), and computation complexity in terms of data volumes to process per thousand instructions (B/KI). To evaluate the benefits of different accelerated systems, we also created 14 heterogeneous workloads by mixing six applications. All these configurations are presented at the right side of the table, where the symbol • indicates the correspondent applications, which are included for such heterogeneous workloads. Profile methods. To analyze the execution time breakdown of our accelerated systems, we instrument timestamp annotations in both host-side and accelerator-side application source codes by using a set of real-time clock registers and interfaces. The annotations for each scheduling point, data transfer time, and execution period enable us to capture the details of CPU active cycles and accelerator execution cycles at runtime. On the other hand, we use a blktrace analysis [6] to measure device-level SSD performance by removing performance interference potentially brought by any modules in storage stack. We leverage Intel power gadget tool to collect the host CPU energy and host DRAM energy [43] . Lastly, we estimate the energy consumption of FlashAbacus hardware platform based on TI internal platform power calculator, and use our in-house power analyzer to monitor the dynamic power of SSD [34, 79] . Figure 10 demonstrates the overall throughput of all hardware accelerators that we implemented by executing both homogeneous workloads and heterogeneous workloads. Generally speaking, the proposed FlashAbacus approach (i.e., IntraO3) outperforms the conventional accelerator approach SIMD by 127%, on average, across all workloads tested.
Data Processing Throughput
Homogeneous workloads. Figure 10a shows the overall throughput evaluated from homogeneous workloads. In this evaluation, we generate 6 instances from each kernel. Based on the computation complexity (B/KI), we categorize all workloads into two groups: i) data-intensive and ii) computing-intensive. As shown in the figure, all our FlashAbacus approaches outperform SIMD for data-intensive workloads by 144%, on average. This is because, although all the accelerators have equal same computing powers, SIMD cannot process data before the data are brought by the host system. However, InterSt has poorer performance, compared to InterDy and IntraO3, by 53% and 52%, respectively. The reason is that InterSt is limited to spreading homogeneous workloads across multiple workers, due to its static scheduler configuration. Unlike InterSt, IntraIo enjoys the benefits of parallelism, by partitioning kernels into many screens and assigning them to multiple LWPs. However, IntraIo cannot fully utilize multiple LWPs if there is a microblock that has no code segments and that cannot be concurrently executed (i.e., serial MBLK). Compared to IntraIo, IntraO3 successfully overcomes the limits of serial MBLK by borrowing multiple screens from different microblocks, thereby improving the performance by 62%, on average. For homogeneous workloads, InterDy achieves the best performance, because all the kernels are equally assigned to a single LWP and are simultaneously completed. While IntraO3 can dynamically schedule microblocks to LWPs for a better load balance, IPC overheads (between Flashvisor and workers) and scheduling latency of out-of-order execution degrade IntraO3's performance by 2%, on average, compared to InterDy.
Heterogeneous workloads. Figure 10b shows the overall system bandwidths of different accelerated systems under heterogeneous workloads. For each execution, we generate 24 instances, four from a single kernel. SIMD exhibits 3.5 MB/s system bandwidth across all heterogeneous workloads, on average, which is 98% worse than that of data-intensive workloads. This is because, heterogeneous workloads contain computing-intensive kernels, which involve much longer latency in processing data than storage accesses. Although the overheads imposed by data movement are not significant in computing-intensive workloads, the limited scheduling flexibility of SIMD degrades performance by 42% and 50%, compared to InterDy and IntraO3, respectively. In contrast to the poor performance observed in homogeneous workloads, InterSt outperforms IntraIo by 34% in workloads MX2,3,4,6,7,9 and 13. This is because SIMD can statically map 6 different kernels to 6 workers. However, since different kernels have various data processing speeds, unbalanced loads across workers degrade performance by 51%, compared to IntraIo, for workloads MX1,5,8,10,11 and 12. Unlike InterSt, InterDy monitors the status of workers and dynamically assigns a kernel to a free worker to achieve better load balance. Consequently, InterDy exhibits 177% better performance than InterSt. Compared to the best performance reported by the previous homogeneous workload evaluations, performance degradation of InterDy in these workloads is caused by a stagger kernel, which exhibits much longer latency than other kernels. In contrast, IntraO3 shortens the latency of the stagger kernel by executing it across multiple LWPs in parallel. As a result, IntraO3 outperforms InterDy by 15%. Figure 13 : Analysis for energy decomposition (all results are normalized to SIMD). 
Execution Time Analysis
Homogeneous workloads. Figure 11a analyzes latency of the accelerated systems (with homogeneous workloads). For data-intensive workloads (i.e., ATAX, BICG, 2DCONV, and MVT ), average latency, maximum latency, and minimum latency of SIMD are 39%, 87%, and 113% longer than those of FlashAbacus approaches, respectively. This is because, SIMD consumes extra latency to transfer data through different I/O interfaces and redundantly copy I/Os across different software stacks. InterSt and InterDy exhibit similar minimum latency. However, InterDy has 57% and 68% shorter average latency and maximum latency, respectively, compared to InterSt. This is because, InterSt can dynamically schedule multiple kernels across different workers in parallel. The minimum latency of inter-kernel schedulers (InterSt and InterDy) is 61% longer than that of intra-kernel schedulers (IntraIo and IntraO3), because intra-kernel schedulers can shorten single kernel execution by leveraging multiple cores to execute microblocks in parallel.
Heterogeneous workloads. Figure 11b shows the same latency analysis, but with heterogeneous workloads. All heterogeneous workloads have more computing parts than the homogeneous ones, data movement overheads are hidden behind their computation somewhat. Consequently, SIMD exhibits kernel execution latency similar to that of IntraIo. Meanwhile, InterSt exhibits a shorter maximum latency for workloads MX4,5,8 and 14, compared to IntraIo. This is because, these workloads contain multiple serial microblocks that stagger kernel execution. Further InterSt has the longest average latency among all tested approaches due to inflexible scheduling strategy. Since IntraO3 can split kernels into multiple screens and achieve a higher level of parallelism with a finer-granule scheduling, it performs better than InterDy in terms of the average and maximum latency by 10% and 19%, respectively.
CDF analysis. Figures 12a and 12b depict the execution details of workload ATAX and MX1, each representing homogeneous and heterogeneous workloads, respectively. As shown in Figure 12a , InterDy takes more time to complete the first kernel compared to IntraIo and IntraO3, because InterDy only allocates single LWP to process the first kernel. For all the six kernels, InterO3 also completes the operation earlier than or about the same time as InterDy.
More benefits can be achieved by heterogeneous execution. As shown in Figure 12b , for the first four data-intensive kernels, SIMD exhibits much longer latency compared to FlashAbacus, due to the system overheads. While SIMD outperforms InterSt for the last two computation-intensive kernels, IntraO3 outperforms SIMD by 42%, on average.
Energy Evaluation
In Figure 13 , the energy of all five tested accelerated systems is decomposed into the data movement, computation and storage access parts. Overall, IntraO3 consumes average energy less than SIMD by 78.4% for all the workloads that we tested. Specifically, all FlashAbacus approaches exhibit better energy efficiency than SIMD for data-intensive workloads (cf. Figure 13a ). This is because, the host requires periodic transfer of data between the SSD and accelerator, consuming 71% of the total energy of SIMD. Note that, InterSt consumes 28% more energy than SIMD for computing-intensive workloads, GEMM, 2MM, and SYR2K. Even though InterSt cannot fully activate all workers due to its inflexibility, it must keep Flashvisor and Storengine always busy for their entire execution, leading to inefficient energy consumption behavior. However, IntraIo, InterDy, and IntraO3 reduce the overall energy consumption by 47%, 22%, and 2%, respectively, compared to InterSt. This is because IntraO3 can exploit the massive parallelism (among all Figures 14a and 14b show an analysis of LWP utilizations obtained by executing homogeneous and heterogeneous workloads. In this evaluation, LWP utilization is calculated by dividing the LWP's actual execution time (average) by the latency to execute entire application(s) in each workload. For data-intensive ones, most of LWP executions are stalled due to the long latency of storage accesses, which make SIMD's LWP utilizations lower than InterO3 by 23%, on average. On the other hand, InterDy keeps all processors busy by 98% of the total execution time, which leads the highest LWP utilization among all accelerated systems we tested. IntraO3 schedules microblocks in out-of-order manner and only a small piece of code segment is executed each time, which makes its LWP utilization slightly worse than InterDy ones (less than 1%) for the execution of homogeneous kernels. In contrast, for heterogeneous kernel executions, IntraO3 achieves processor utilization over 94%, on average, 15% better than that of InterDy. Since LWP utilization is strongly connected to the degree of parallelism and computing bandwidth, we can conclude that IntraO3 outperforms all the accelerated systems.
Processor Utilizations
Dynamics in Data Processing and Power
Figures 15a and 15b illustrate the time series analysis results for utilizing functional units and power usage of workers, respectively. In this evaluation, we compare the results of IntraO3 with those for SIMD to understand in detail the runtime behavior of the proposed system. As shown in Figure 15a , IntraO3 achieves shorter execution latency and better function unit utilization than SIMD. This is because, IntraO3 can perform parallel execution of multiple serial microblocks in multiple cores, which increases the function unit utilization. In addition, due to the shorter storage access, IntraO3 can complete the execution 3600 us earlier than SIMD. On the other hand, IntraO3 exhibits much lower power consumption than SIMD b fs w c n n n w p a th during storage access (Figure 15b ). This is because SIMD requires the assistance of host CPU and main memory in transferring data between the SSD and accelerator, which in turn consumes 3.3x more power than IntraO3. Interestingly, the pure computation power of IntraO3 is 21% higher than SIMD, because IntraO3 enables more active function units to perform data processing, thus taking more dynamic power.
Extended Evaluation on Real-world Applications
Application selection. In this section, we select five representative data-intensive workloads coming from graph benchmarks [12] and bigdata benchmarks [24] in order to better understand system behaviors with real applications: i) K-nearest neighbor (nn), ii) graph traversal (bfs), iii) DNA sequence searching (nw), iv) grid traversal (path) and v) mapreduce wordcount (wc)).
Performance. Figure 16a illustrates data processing throughput of our FlashAbacus by executing the graph/bigdata benchmarks. One can observe from this figure that the average throughput of FlashAbacus approaches (i.e., IntraIo, InterDy and IntraO3) outperform SIMD by 2.1x, 3.4x and 3.4x, respectively, for all dataintensive workloads that we tested. Even though SIMD can fully utilize all LWPs, and it is expected to bring an excellent throughput for the workloads that have no serialized faction of microblocks (i.e., nw and path), SIMD unfortunately exhibits poor performance than other FlashAbacus approaches. This performance degradation is observed because LWPs are frequently stalled and bear brunt of comparably long latency imposed by moving the target data between its accelerator and external SSD over multiple interface and software intervention boundaries. Similarly, while SIMD's computation capability is powerful much more than InterSt's one, which requires to execute multiple kernels in a serial order, InterSt's average throughput is 120% and 131% better than SIMD's performance for nw and path, respectively. Energy. We also decompose the total energy of the system, which is used for each data-intensive application, into i) data movement, ii) computation, and iii) storage access; each item represents the host energy consumed for transferring data between the accelerator and SSD, the actual energy used by accelerator to compute, and the energy involved in serving I/O requests, respectively. The results of this energy breakdown analysis is illustrated by Figure 16b . One can observe from this figure that InterSt, IntraI0, InterDy and IntraO3 can save the total average energy for processing all data by 74%, 83%, 88% and 88%, compared to SIMD, respectively. One of the reasons behind of this energy saving is that the cost of data transfers account for 79% of the total energy cost in SIMD, while FlashAbacus eliminates the energy, wasted in moving the data between the accelerator and SSD. We observe that the computation efficiency of InterDy and IntraO3 also is better than that SIMD for even the workloads that have serial microblocks (i.e., bfs and nn). This is because OpenMP that SIMD use should serialize the executions of serial microblocks, whereas Flashvisor of the proposed FlashAbacus coordinates such serial microblocks from different kernels and schedule them across different LWPs in parallel.
DISCUSSION AND RELATED WORK
Platform selection. The multicore-based PCIe platform we select [29] integrates a terabit-bandwidth crossbar network for multi-core communication, which potentially make the platform a scale-out accelerator system (by adding up more LWPs into the network). In addition, considering a low power budget (5W ∼20W), the platform we implemented FlashAbacuse has been considered as one of the most energy efficient (and fast) accelerator in processing a massive set of data [25] . Specifically, it provides 358 Gops/s with 25.6 Gops/s/W power efficiency, while other low-power multiprocessors such as a many-core system [54] , a mobile GPU [51, 52] , a FPGAbased accelerator [77] are available to offer 10.5, 256 and 11.5 Gops/s with 2.6, 25.6 and 0.3 Gops/s/W power efficiency, respectively. Limits of this study. While the backend storage complex of FlashAbacus is designed as a self-existent module, we believe that it is difficult to replace with conventional off-the-shelf SSD solutions. This is because our solution builds flash firmware solution from scratch, which is especially customized for flash virtualization. We select a high-end FPGA for the flash transaction controls as research purpose, but the actual performance of flash management we believe cannot be equal to or better than microcoded ASIC-based flash controller due to the limit of low frequency of FPGA. Another limit behind this work is that we are unfortunately not able to implement FlashAbacus in different accelerators such as GPGPU due to access limits for such hardware platforms and a lack of supports for code generations. However, we believe that the proposed self-governing design (e.g., flash integration and multi-kernel execution) can be applied to any type of hardware accelerators if vendor supports or open their IPs, which will outperform many accelerators that require employing an external storage.
Hardware/software integration. To address the overheads of data movements, several hardware approaches tried to directly reduce long datapath that sit between different computing devices. For example, [4, 20, 61] forward the target data directly among different PCIe devices without a CPU intervention. System-on-chip approachs that tightly integrate CPUs and accelerators over shared memory [17, 75] , which can reduce the data transfers between. However, these hardware approaches cannot completely address the overheads imposed by discrete software stacks. Since SSDs are block devices, which can be shared by many other user applications, all the SSD accesses should be controlled by the storage stack appropriately. Otherwise, the system cannot guarantee the durability of storage, and simultaneous data access without an access control of storage stack can make data inconsistent, in turn leading to a system crash. In contrast, a few previous studies explored reducing the overheads of moving data between an accelerator and a storage by optimizing the software stack of the two devices [63, 73, 80] . However, these studies cannot remove all limitations imposed by different physical interface boundaries, and unfortunately, their solutions highly depends on the target system's software and software environment.
In-storage processing. Prior studies [2, 13, 22, 41, 72] proposed to eliminate the data movement overheads by leveraging the existing controller to execute a kernel in storage. However, these in-storage process approaches are limited to a single core execution and their computation flexibility is strictly limited by the APIs built at the design stage. [33] integrates SSD controller and data processing engine within a single FPGA, which supports parallel data access and host task processing. While this FPGA-based approach is well optimized for a specific application, porting the current host programs to RTL design can bring huge burdens to programmers and still inflexible to adopt many different types of data-intensive applications at runtime. FlashAbacus governs the storage within an accelerator and directly executes multiple general applications across multiple light-weight processors without any limit imposed by the storage firmware or APIs.
Open-channel or cross-layer optimization. Many proposals also paid attention on cross-layer optimizations or open-channel SSD approaches that migrates firmware from the device to kernel or partitions some functionalities of flash controller and implement it into OS [7, 26, 36, 39, 46, 53] . While all these studies focused on managing the underlying flash memory and SSD efficiently, they have a lack of considerations on putting data processing and storage together in an hardware platform. In addition, since these proposals cannot be applied into an single hardware accelerator, which has no storage stack, but requires byte-addressability to access the underlying storage complex.
CONCLUSION
In this work, we combined flash with lightweight multi-core accelerators to carry out energy-efficient data processing. Our proposed accelerator can offload various kernels in an executable form and process parallel data. Our accelerator can improve performance by 127%, while requiring 78% less energy, compared to a multi-core accelerator that needs external storage accesses for data-processing.
ACKNOWLEDGEMENT
This research is mainly supported by NRF 2016R1C1B2015312 and MemRay grant (2015-11-1731). This work is also supported in part by, DOE DEAC02-05CH 11231, IITP-2017-2017-0-01015, and NRF2015M3C4A7065645. The authors thank Gilles Muller for shepherding this paper. We also thank MemRay Corporation and Texas Instruments for their research sample donation and technical support. Myoungsoo jung is the corresponding author.
