ABSTRACT e edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and high response time among others. is new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and energy consumption are two main factors that should be considered during the design of such systems. Focusing on performance and energy consumption, this paper studies the opportunities and challenges that a heterogeneous embedded system consisting of embedded FPGAs and GPUs (as accelerators) can provide for applications. We study three design, modeling and scheduling challenges throughout the paper. We also propose three techniques to cope with these three challenges. Applying the proposed techniques to three applications including image histogram, dense matrix-vector multiplication and sparse matrix-vector multiplications show 1.79x and 2.29x improvements in performance and energy consumption, respectively, when both FPGA and GPU execute the corresponding application in parallel.
INTRODUCTION
e emergence of edge computing, which brings the analytics, decision making, automation and security tasks close to the source of data and applications, has raised new opportunities and challenges in the area of IoT and embedded systems. is new computing trend enables the execution of cloud-native tasks on resource-limited embedded systems. e versatile and dynamic behavior of these tasks has changed the traditional de nition of an embedded system that has been mainly de ned as a small system tuned to e ciently run a speci c task inside a big system. Recently Google has introduced the tensor processing unit (TPU) to e ciently run neural-networkbased machine learning algorithms on the edge [8] . Amazon has announced the AWS Greengrass to bring cloud computing to the edge [2] .
New embedded systems demand new features such as e ciently working with Internet, enabling highly computational power, consuming low energy, providing real-time at the scale of machinery with nanosecond latency and working collaboratively with other similar systems to nish a shared task. Heterogeneous embedded systems are promising techniques to cope with these everincreasing demands. Toward this end, FPGAs and GPUs, the two common accelerators, have separately been integrated into embedded systems recently, by industry, to address the new requirements. However, integrating them in an embedded system to collaboratively execute a complex task, ful lling the performance, latency, predictability, and energy consumption constraints, is still a challenge. Fig. 1 shows the overview of an embedded system consisting of three processing elements (PEs) including a multi-core CPU, a many-core GPU and an FPGA. e main feature of this architecture is the direct access of PEs to the main memory using the same address space and shared memory controller, in contrast to the current desktop platforms with FPGAs and GPUs that communicate via PCIe with system memory. is feature enables the accelerators to bene t from zero-copy data transfer technique without the performance and energy overhead of the PCIe in between, which improves the memory bandwidth utilization and reduce the inter PEs communication overhead. erefore, each PE can potentially achieve its high performance in executing an application. However, choosing a proper PE to run a given task, with maximum performance and minimum energy consumption, is not an easy decision to make. To make this process clear, we study and compare the performance and energy consumption of accelerators (i.e. the GPU and FPGA), running di erent tasks.
To this end, we need a programming model for each PE, considering the type of applications. ere are many academic and industrial programming models, libraries and tools to e ciently implement di erent applications on embedded CPUs and GPUs. However, there is no speci c design methodology for using embedded FPGAs in a system in spite of available high-level synthesis (HLS) tools based on C/C++, SystemC and OpenCL languages. is is mainly because, in the FPGA-based accelerator design, designers should rst provide a hardware architecture suitable for a given task and then implement the task algorithm, accordingly. is process makes the FPGA-based accelerator design complex which needs more research to nd systematic approaches for addressing di erent types of applications.
In summary, three main challenges in designing a heterogeneous FPGA+GPU platform should be studied, which are as follows.
• Design challenge: implementing a given task on FPGA that can compete with that of the GPU • Modeling challenge: evaluating and predicting the performance and energy consumption of FPGA and GPU • Scheduling challenge: distributing parallel task between FPGA and GPU in order to optimize the overall performance and energy consumption
Focusing on embedded FPGA and GPU, this paper explains the opportunities that addressing the above challenges can bring to the edge computing platforms. We, rst, propose a systematic stream computing approach for implementing various applications on embedded FPGAs using HLS tools and then study the opportunities and challenges that a synergy among FPGA and GPU in an embedded system can provide for designers. We study a few applications that their collaborative execution on the heterogeneous system brings higher performance and lower energy consumption. We show that the collaboration between embedded FPGA and GPU can bring a renaissance to the edge computing scenario.
e rest of this paper is organized as follows. e next section explains the motivations and contributions behind this paper. e previous work is reviewed in Section 3. e proposed FPGA stream computing engine is discussed in Section 4. Section 5 studies the performance and power modeling techniques. e scheduling challenge is explained in Section 6. e experimental setup is addressed in Section 7. Section 8 explains the experimental results. Finally, Section 9 concludes the paper.
MOTIVATIONS AND CONTRIBUTIONS
Taking the histogram operation, one of the common tasks in image processing, data mining, and big-data analysis, this section explains the motivations and contributions behind this paper. For this purpose, we have considered two embedded systems including Nvidia Jetson TX1 [21] and Xilinx Zynq MPSoC (ZCU102 evaluation board) [28] . Fig. 2 shows the block diagrams of di erent parts in these systems. e Zynq MPSoC, in Fig. 2(a) , mainly consists of two parts: Processing System (PS) and Programmable Logic (PL).
ese two subsystems have a direct access to the system DDR memory. e PL (i.e., FPGA) performs its memory transaction through a few high-performance ports including four HPs, two HPCs, and an ACP ports. In this paper, we focus on four HP ports that can collaboratively transfer data between FPGA and memory, utilizing all the memory bandwidth available to the FPGA. e Nvidia Jetson TX1, shown in Fig. 2(b) , is a system-on-module (SoM) combining the Nvidia Tegra X1 SoC with 4GB LPDDR4 memory and some other modules [21] . e Nvidia Tegra X1 SoC consists of a Maxwell GPU with 256 CUDA cores, 1.6GHz/s, 128K L2 cache, and 4 channel x 16bit interface to access the system memory.
Two e cient implementations of the histogram are provided for the two embedded systems. e CUDA language is used for the GPU implementation in which the NVIDIA Performance Primitives (NPP) library [20] is used. In addition, the C++ language and the Xilinx SDSoC toolset are used for the FPGA implementation which is based on the streaming pipelined computing approach similar to [11] . is implementation reads data from the system memory and modi es the histogram bins in each clock cycle. Fig. 3 shows the execution time of the histogram operator running on the two di erent embedded systems considering two separate images, denoted by image1 and image2, with di erent sizes (512 × 512, 1024 × 1024, 2048 × 2048, and 8192 × 8192 ). Whereas image1 is based on a real picture, image2 contains only randomly generated pixels. As can be seen, the FPGA shows be er performance in most cases and its performance does not depend on the image content, resulting in a deterministic behavior that is predictable if the image data size is known. However, the performance of the histogram implementation on the GPU depends on the image content which makes the prediction di cult even if the image size is known a priori. Note that in two cases of image1(2048 × 2048) and image1(8192 × 8192) the GPU implementation is faster than that of the FPGA. Fig. 4 depicts the power and energy consumption of the histogram. Fig. 4(a) shows the power consumption on the two embedded systems for di erent image sizes. As can be seen, the embedded FPGA shows much less power consumption than that of the embedded GPU. Now if we equally divide the image1 of size 8192 × 8192 between the embedded FPGA and GPU, then the execution time on FPGA and GPU would be about 3.51ms and 4.35ms, respectively which improves the performance by a factor 6.99/4.35 = 1.6. In this case, the FPGA and GPU energy consumptions are 4133.8µ and 13653.9µ , respectively which improves the total energy consumption by a factor of 1.59. Fig. 5 shows the trade-o between the energy consumption and performance for running the histogram on FPGA, GPU and both.
is trade-o has motivated us to study the performance and energy consumption of di erent applications on both platforms and propose an FPGA+GPU based embedded systems to improve the total performance and energy consumption by scheduling a given task between these two accelerators. e main contributions of this paper are as follows:
• Studying the challenges of design, modeling and scheduling on FPGA+GPU embedded systems.
• Clarifying the opportunities that addressing these challenges provide • Proposing a stream computing technique on FPGA to deal with the design challenge • Modelling the FPGA performance and power consumption to cope with the modeling challenge. • Proposing an FPGA+GPU embedded system to improve performance and energy consumption to address the scheduling challenge
PREVIOUS WORK
ere have been extensive studies on employing GPU and FPGA on desktop and cloud servers in the literature.
An OpenCL-based FPGA-GPU implementation for the database join operation is proposed in [22] . ey use the Xilinx OpenCL SDK (ie.e, SDAccel) to explore the design space. A real-time embedded heterogeneous GPU/FPGA system is proposed by [23] for radar signal processing. An energy-e cient sparse matrix multiplication is proposed in [9] which utilizes the GPU, Xeon Phi, and FPGA. An FPGA-GPU-CPU heterogeneous architecture has been considered in [17] to implement a real-time cardiac physiological optical mapping. All these systems use the PCIe to connect the GPU and FPGA to the host CPU. In contrast to these approaches, we assume a direct connection between the accelerators and the system memory.
A heterogeneous FPGA/GPU embedded system based on the Intel Arria 10 FPGA and the Nvidia Tegra X2 is presented in [5] to perform ultrasound imaging tasks. In contrast to this approach, we study the challenges and opportunities that hybrid FPGA/GPU embedded systems can bring to the edge computing by considering wider types of tasks and applications.
DESIGN CHALLENGE
is paper considers streaming applications which can receive data, perform computation, and generate results in a pipeline fashion. Many tasks can be categorized as streaming applications, among them are data parallel, window, and block processing tasks [12] .
ere are many techniques and research that show how to map a streaming application on GPUs [3, 6, 12, 13, 19, 24] , however, Figure 6 : Overview of Stream computing engine on FPGA e ciently mapping these applications on FPGAs, using a systematic approach, requires more research.
Traditionally, FPGA accelerators are designed by Hardware Description Languages (HDLs) that can potentially provide a highperformance implementation. However, the HDL based design ow is tedious and time-consuming. In addition, the design is not easily adaptable (modi able) to the versatile edge computing environment that includes a variety of algorithms with di erent con gurations and complexity. To alleviate these issues, High-Level Synthesis (HLS) has been proposed by academia and industry that is increasingly popular for accelerating algorithms in FPGA-based embedded platforms. Studies have shown that HLS can provide highperformance and energy-e cient implementations with shortening time-to-market and addressing today's system complexity [18] . Following the HLS design ow, we propose a streaming pipelined computing engine to implement several applications. Fig. 6 shows the overview of the proposed stream computing. It consists of memory interfaces to communicate with the system memory and computational pipelines. ere can be multiple pipelined chains in the FPGA that receive/send their data from/to memory through the ports available on the system (such as HP ports available on the Xilinx Zynq MPSoC). Each pipeline can consist of a few stages including read, rearrange, computation, and write. e read stage fetches a stream of data from memory using the multiple wide-bit ports.
e rearrange stage reorganizes the data by spli ing and concatenating operators to prepare the read data to be used in the successor stages. e computation stage performs the main job in the given task.
A pipelined for loop is usually used to implement each stage whose initiation interval (II ) de nes its throughput.
e II of a pipelined loop is the minimum clock cycles between the starting point of the two consecutive loop iterations. If n and l denote the number of iterations in a loop and one iteration latency, respectively, then a pipelined loop requires (nII + l) clock cycles to nish.
e stage with maximum II restricts the total throughput and determines the execution time. If II max and n max denote the maximum II and the maximum number of iterations of the stages in a pipeline, respectively, then the total clock cycles require to nish a pipeline is determined by Equ. 1, where l t ot al is the total latency of one iteration of all stages in the pipeline.
MODELLING CHALLENGE
Performance and power modeling are the key steps in designing a task e ciently on a heterogeneous system. Di erent types of modeling techniques have been proposed for GPUs on a system [1, 7, 10, 14, 15, 25, 27] . ere are also a few articles proposing power and performance modeling for applications [16, 26, 30] on FPGA.
Most of these approaches are application speci c and consider the FPGA resource utilization or are simulation based. In contrast, we propose a high-level power and performance model suitable for an application implemented by HLS tools. is section addresses the power and performance modeling of streaming tasks running on an FPGA using the stream computing engine proposed in Fig. 6 .
Performance
Traditionally, processing elements show their maximum performance if they can use their internal memory. For example, utilizing di erent levels of cache memories in CPUs is the main factor of improving several application performances. GPUs utilize L2 cache memory along with device memories to improve the performance and provide parallel data access for many streaming processors in their architecture. FPGA also bene ts from their internal BRAMs and distributed registers to save data temporarily during the computation. e FPGA internal memories have the capabilities to be used as cache memory tailored to the implemented task on the FPGA. ere have been many research activities on modifying the device and cache memory architectures to improve the performance on GPUs and CPUs, such that, repetitive applications with the data-reuse feature that can transfer the data once to the device or cache memories and bene ts from their low latency and high-speed. However, applications that require fresh data in each iteration, such as database processing, su er from the high latency of accessing the system memory. Using zero-copy methodology and pipelining the data transfer with data computation are techniques to alleviate the relatively high latency of the system memory. e zero-copy technique maps the system memory as the device memory to be accessed directly by processing elements. e Nvidia Jetson TX1 can utilize the zero-copy using the uni ed memory programming technique, rst introduced in CUDA 6.0. e proposed streaming engine in Fig. 6 also bene ts from the zero-copy technique to read data from the system memory which is pipelined with the computation. However, some part of a task may not be able to bene t from this technique. For example, in dense matrix-vector multiplication which is described by Equ. 2, the vector x should be located in the FPGA internal memory (e.g., BRAM) to be reused for calculating each element of the output vector (i.e., ). In this case, a stream computing engine only with one stage (which is a pipelined for loop) can transfer the x vector to the BRAM, then a streaming computing engine with three stages can read the elements of matrix A to generate the output. Fig. 7 shows this two-step stream processing. e rst step is a pipelined loop with m iteration count, where m is the size of vector x. e second step can be implemented by pipelined for loops with n × m iteration count, where n is the size of the output vector. Note that, both steps share the same memory interface, however, they are 
e number of clock cycles for nishing each step in Fig. 7 can be described by Equ. 1. e II on the rst step is one as it uses burst data transfer and its loop iteration count is m, therefore, it takes (m +l 1 ) clock cycles to nish, where l 1 is the latency of one iteration.
e initiation interval of the second step can be one (the optimize implementation is presented in Section 7) and its loop iteration count is n × m. erefore, it takes (n × m + l 2 ) clock cycle to nish, where l 2 is the latency of one ieration of all loops involved in the pipeline. Equ. 3 represents the total clock cycles required to nish the whole task. If the size of input matrix is large enough to ignore the m, l 1 , and l 2 terms, the Equ. 4 represents the performance of the task running on the FPGA which is directly de ned by the data size (i.e., input matrix). Fig. 8(a) shows the execution time versus data size for the dense matrix vector multiplication.
Equation 3 can be generalized to Equ. 5 to model the performance of a task with S stages.
Power and Energy
e power consumption of a task running on an accelerator usually consists of two main parts: the accelerator and the memory power consumptions. e accelerator power consumption is dened based on the number of switching activities that happens in the underlying semiconductor fabrication cause by value changes on the input data. In this section, we propose a simple model for the average FPGA and memory power for the stream computing engine proposed in the previous section. For the sake of simplicity, lets take the dense matrix-vector multiplication shown in Fig. 7 . If we assume that p 1 and p 2 represent the average power of the rst and second stages, respectively, then Equ. 6 or Equ. 7 shows the total average power. Note that in this formula, we have ignored the iteration latencies (i.e., l 1 and l 2 ) in Equ. 3, for the sake of simplicity.
For large data sizes the second term in Equ. 7 mainly de nes the power and for small data sizes both terms are comparable and determine the total power. Fig. 8(b) shows the power consumption versus data size for the dense matrix vector multiplication.
is formula can be generalized for tasks with more stages as Equ. 8 where S in the number of stages and p s and n s represent the power and data size of each stage.
SCHEDULING CHALLENGE
Task scheduling among multiple processors in a system is a mature subject with extensive research activities. However, they need a kind of modi cation and tuning to be applied to a new system such as the heterogeneous FPGA+GPU embedded system considered in this paper. For the sake of simplicity, we only consider the scheduling problem in data parallel tasks. In this case, we should divide the data between the FPGA and GPU to achieve high performance. For this purpose, both FPGA and GPU should utilize their maximum performance and should nish their tasks at the same time. In other words, a load balancing is required for maximum performance.
Here we only propose a simple task division between FPGA and GPU for large data sizes so that the behavior of the system is more predictable and depends on the data sizes. Considering this assumption, the FPGA and GPU execution times are directly proportional to the data size which are shown in Equs. 9 and 10, where n f p a and n pu are the data sizes on the FPGA and GPU, respectively, a and b are constant that can be determined by data pa erns. In this case, task division and load balancing can be described by Equ. 11 and 12, respectively. Solving these equations results in Equ 13. If α represents the GPU speed-up compared to the FPGA (i.e., α = a/b), then Equ. 14 shows the task scheduling solution. Section 7 empirically evaluates this task scheduling solution. 
n f p a = 1 α + 1 n and n pu = α 1 + α n (14) Although the FPGA is connected to the Jetson TX1 over a PCIe bus, it still can be used to study some of the features and behaviors of the heterogeneous embedded systems if we assume that the input data is available in the FPGA onboard memory to which FPGA has direct access with a 512-bit wide AXI bus. Fig. 9 illustrates the system hardware architecture through which the FPGA is connected to the Jetson TX1 board over a 4x PCIe bus.
EXPERIMENTAL SETUPS
e FPGA hardware is comprised of two sections. e rst section, consisting of the Xillybus IP [29] , data transfer unit (DTU) and DDR3 interface, provides the data path between the PCIe and the onboard DDR memory. e Xillybus IP provides a streaming data transfer over PCIe, DTU receives this stream and copies that into the DDR3 memory using master AXI bus through DDR3 interface. Fig. 7 shows the high-level C code for the write-tomemory parts of the data transfer unit (DTU) synthesizable with the Xilinx Vivado HLS. It consists of a pipelined loop that receives a unit of data and writes it to the memory in each clock cycle.
e maximum memory bandwidth provided by the rst path is 800MBytes/s mainly because of the PCIe Gen1 used in the Jetson TX1 which is compatible with the Xilinx IP core located in the Vitex-7 FPGA.
e second path consists of the user design and the DDR3 interface which can provide up to 6.4GB te/s using a 512-bit wide-bus at the frequency of 100MHz.
A Xilinx MicroBlaze so ware core is used to control the activation of di erent paths in the FPGA. For this purpose, it runs a rmware that receives di erent commands from the host processor on the Jetson TX1 through PCIe and activates the application design. is controller also informs the system when the task execution nishes. Onboard memory management and allocation are other tasks of the controller. In summary, the rmware running on the MicroBlaze performs the following functions:
• initFpga: is function performs the FPGA initialization and prepares the memory allocation tables on the MicroBlaze.
• fpgaMalloc:
is function gets two arguments: the ID variable and the size of the memory requested for allocation. It returns the start address of the allocated memory or -1 in the case of failure of the memory allocation process.
• startAccel: Receiving this command, the MicroBlaze activates the design to perform its task.
• fpgaFree: is function resets the memory allocation table corresponding to the allocated memories.
e algorithm under acceleration is described in HLS C/C++ that is synthesizable by the Xilinx Vivado-HLS which uses the AXI master protocol to send/receive data to/from DDR3 memory using the burst data transfer protocol.
EXPERIMENTAL RESULTS
ree di erent tasks are studied as our benchmarks in this section to evaluate the potential of embedded FPGA+GPU system in providing a high-performance and low energy consumption system. e results show that the concurrent execution between FPGAs and GPUs can result in 2x performance or energy reduction a er e cient algorithm implementation, correct workload balancing and data transfer optimizations. ese three algorithms are: histogram, dense matrix-vector multiplication (DeMV), and sparse matrix-vector multiplication (SpMV). e experimental setup explained in Section 7 is used for real measurements. In addition, for the sake of completeness, the two distinct Jetson TX1 and ZynqMpsoC systems are also used to generate results for comparison even if they are not connected. Fig. 11(a) shows the original histogram pseudo code. It consists of a for loop iterating over the entire input data, modifying the hist array (as the histogram bin holder) using the input data as the index to access the corresponding bin. is naïve algorithm can be easily pipelined on the FPGA using the Xilinx Vivado-HLS tool, however because of the data dependency between two consecutive loop iterations (note that two consecutive iterations can modify the same bin in the hist array), the obtained initiation interval is 2 which reduces the performance. Fig. 11(b) shows one hardware thread of the stream computing implementation of the histogram suitable for FPGA. It consists of two stages. e rst stage from Line 1 to Line 3 reads data from the memory using the burst protocol i.e., reading a data per clock cycle or II=1. e second stage modi es the bins. As the initiation interval of the pipelined loop for the hist modi cation is 2, this loop reads two data and modi es the hist by resolving the potential con ict using the if condition at Line 9. As this stage reads two data values in each iteration and its II = 2, then the average number of data read per clock cycle is 2/2 = 1, that means, it consumes the data at the same pace that is generated by the rst stage. As the total memory bus width in ZynqMPSoC and Virtex 7 is 512 and if each pixel in the input image is represented by an 8-bit code, then 512/8 = 64 hardware threads can be instantiated to implement histogram on the FPGA. Table 2 shows the resource utilization of 64-thread implementations of the histogram on Zynq MPSoC and Vertex7 FPGAs. e power consumptions of histogram task versus the data sizes on the three platforms are shown in Fig. 12 . As mentioned in Subsection 5.2, the power consumption consists of two components: the accelerator (i.e., GPU or FPGA) and the memory. As can be seen from these diagrams, running the histogram on the zynq MPSoC consumes the least power among the three platforms. As the two Jetson TX1 and Zynq MPSoC utilize embedded memories, their memory power consumption is less than the Virtex 7 memory power requirement. e GPU consumes about 7.7 and 4.8 times more power than Zynq MPSoC and Virtex 7. Fig. 13 compares the histogram execution time and energy consumption versus the data size, considering the three platforms. As can be seen, although the performance of this task is very close on the Jetson TX1 and Zynq MPSoC, its energy consumption on the Zynq MPSoC is about 10 times less than that of the Jetson TX1.
Histogram
According to the performance diagram of Fig. 13 , the speedup factors (i.e., α in Equ. 14) for the Jetson to the Zynq MPSoC and Virtex 7 FPGAs are 0.85 and 2.0 for large data sizes. Table 3 shows the results of task division between the GPU and FPGA using Equ. 14. to divide an input data size of 8388608b tes between the GPU and FPGA. e table shows 1.79 and 2.29 times improvement in performance and energy consumption, respectively, if the task is divided between the Zynq and Jetson compared to only GPU running the application. In addition, it shows 1.18 and 1.45 times improvement in performance and reduction in energy consumption, respectively, if the task is divided between the Virtex 7 and Jetson compared to only the GPU running the application. Fig. 14(a) shows the naïve pseudo-code for the dense matrix-vector multiplication which consists of two nested loops performing the accumulation statement at Line 4. Fig. 14(b) shows one thread of the pipelined version of this task which consists of two stages. e rst stage from Line 1 to Line 4 reads the data on each clock cycle. e pipelined loop in the second stage from Line 6 to Line 12 shows an II=4 a er synthesis which reduces the total performance. In order to address this issue, we have unrolled this loop with a factor of 4 to read four data values in each iteration. erefore, it consumes the data at the same pace that is generated by the rst stage. is Table 4 : DeMV FPGA resource utilization results in the II=1 for the whole design. Table 4 shows the FPGA resource utilization. Fig. 15 shows the power consumption diagrams of running DeMV on the three embedded platforms. e GPU consumes up to 5.20 and 4.3 times more power than Zynq MPSoC and Virtex 7 FPGAs. Fig. 16 compares the DeMV performance and energy consumptions. Similar to the histogram task, the Zynq shows much less energy consumption compared to the other PEs. Table 6 : SpMV FPGA resource utilization According to the performance diagram of Fig. 16 , the speedup factors (i.e., α in Equ. 14) for the Jetson to the Zynq MPSoC and Virtex 7 FPGAs are 0.51 and 0.23 for large data sizes. Table 5 shows the results of task division between the GPU and FPGA using Equ. 14 to divide an input data size of 33554432 between the GPU and FPGA. e table shows 1.48 and 1.19 times improvement in performance and energy consumption, respectively, if the task is divided between the Zynq and Jetson compared to only GPU running the application. In addition, it shows 1.22× improvement in performance and slightly increase (i.e., 1 − 0.96 = 0.04×) in energy consumption, respectively, if the task is divided between the Virtex 7 and Jetson compared to only the GPU running the application.
Dense Matrix-Vector Multiplication (DeMV)

Sparse Matrix-Vector Multiplication (SpMV)
e pseudo-code of the sparse matrix-vector multiplication, based on the Compressed Sparse Row (CSR) representation [4] , is shown in Fig. 17(a) . One thread of the corresponding streaming computation suitable for the FPGA is shown in Fig. 17(b) . e table is shown in Fig. 6 contains the FPGA resource utilization of this task a er synthesis.
e SpMV power consumptions versus data sizes for the three platforms are shown in Fig. 18 . As can be seen, the Zynq MPSoC consumes the least power compared to the other platforms. Fig. 19 compares the performance and energy consumption of the SpMV on Jetson TX1, Zynq MPSoC and Virtex 7.
According to the performance diagram of Fig. 19 , the speed-up factors (i.e., α in Equ. 14) for the Jetson to the Zynq MPSoC and Virtex 7 FPGAs are 3.2 and 6.4 for large data sizes. Table 7 shows the results of task division between the GPU and FPGA using Equ. 14 to divide an input data size of 2943887 between the GPU and FPGA.
e table shows 1.46 and 1.23 times improvement in performance and energy consumption, respectively, if the task is divided between the Zynq and Jetson compared to the only GPU running the application. In addition, it shows 1.15× and 1.1× improvement in performance and reduction in energy consumption, respectively, if the task is divided between the Virtex 7 and Jetson compared to the only GPU running the application.
CONCLUSIONS
is paper has studied the challenges and opportunities that designers will face when using a heterogeneous embedded FPGA+GPU platform. e challenges are categorized into three groups: design, modeling, and scheduling. Using the image histogram operation, the paper has clari ed the trade-o between performance and energy consumption by distributing the task between the GPU and FPGA. Focusing on the FPGA, then the paper has proposed a stream computing engine with the corresponding modeling technique to cope with the design and modeling challenges, respectively. A scheduling technique has been proposed to improve the performance and energy consumption by distributing a parallel task between the FPGA and GPU. ree applications including histogram, dense matrix-vector multiplication, and sparse matrix-vector multiplication are used to evaluate the proposed techniques. e experimental results have shown improvement in performance and reduction in energy consumption by factors of 1.79× and 2.29×, respectively. 
