Abstract-Hardware acceleration is often used to address the need for speed and computing power in embedded systems. FPGAs always represented a good solution for HW acceleration and, recently, new SoC platforms extended the flexibility of the FPGAs by combining on a single chip both high-performance CPUs and FPGA fabric.
I. INTRODUCTION
Nowadays the embedded systems market is always craving for more powerful and faster machines, capable of processing a huge amount of data in the shortest time possible. This need for computing power can be related both to the increase of signal-processing algorithms' complexity and to the growth of the amount of data to process.
A hardware acceleration approach is often used to address this issue. This is done to make the most out of the high parallelism achievable with dedicated hardware structures and to offload the CPU from the computational burden that is related to the execution of these operation in a serial fashion by using an ALU. In this field, FPGAs always represented a good trade-off between flexibility, cost, power consumption and time-to-market. In recent years, newer products extended the concept of flexible platform by combining on a single chip high-performance ARM CPUs (Processing System) and FPGA fabric (Programmable Logic), as in the case of the Zynq TM -7000 SoC [6] .
The benefits introduced by this kind of platforms for acceleration purposes are remarkable and this is why the aim of this work is the implementation of hardware accelerators for these new SoCs. The innovative feature of the accelerators to be developed is the on-the-fly reconfiguration of the hardware. The design methodology, in fact, will take advantage of the latest techniques in terms of FPGAs partial reconfiguration to dynamically adapt the accelerator's functionalities to the current CPU workload, allowing the full exploitation of the SoC in terms of performance and flexibility.
The realization of the accelerators requires the preliminary characterization of the Zynq TM Processing System (i.e. ARM CPU in conjunction with NEON Media Processing Engines) performance and the comparison with a HW acceleration approach. We also need to evaluate the partial reconfiguration times and to develop an application-specific IP-cores library to allow the FPGA to adapt itself to the CPU workload. We decided to rename this library as Hardware Link Library (HLL) because the principle is similar to the Dynamic Link Library but it targets actual hardware modules of the design instead of software modules.
In this paper, we will focus only on the profiling aspect, leaving the partial reconfiguration topic and the development of the IP-cores library for future work.
The paper is structured as follows: in Section 2 we will briefly illustrate the target platform for our implementations while Section 3 will describe the implementation for the Processing System (PS) and the case study on NEON units. The results of NEON units acceleration will be presented in Section 4. Section 5 will deal with the proposed HW implementations and the comparison between HW and SW results (in terms of speed and power consumption) will be presented in section 6 and 7. Finally, we draw the conclusions in Section 8.
II. IMPLEMENTATION PLATFORM
The hardware accelerators are implemented within the Z-7020 device of the Zynq TM -7000 SoC family. The architecture of the Zynq TM comprises two different parts: the Processing System (PS) and the Programmable Logic (PL). These two sections are independent (they also have distinct power domains) and can be used separately or in conjunction. The interconnection between the PS, PL and the software-configurable I/O peripherals is provided by the AMBA AXI bus.
A. Processing System
The Processing System (PS) of the device features a dualcore ARM Cortex-A9 with a 32-bit RISC architecture [2] . Along with the processor, a set of processing resources are available within the Application Processing Unit (APU) of the Zynq. The most important resources for this work are the two NEON Media Processing Engines for SIMD operations [1] , the 32KB L1 and 512KB L2 caches and the 256KB OnChip-Memory (OCM) RAM. A multi-protocol DDR memory controller is provided to support external DRAM memories.
It is worth mentioning that the Programmable Logic configuration is managed by the PS.
B. Programmable Logic
The other main portion of the Zynq-7000 SoC is the Programmable Logic (PL). This general purpose 28nm FPGA fabric is based on the Xilinx Artix-7 technology. The key features, other than the Configurable Logic Blocks (CLBs), are the dual-ported 36Kb BlockRAMs (dedicated memory resources) and the DSP48E Slices (dedicated silicon resources for DSP and high-speed arithmetic).
C. Zedboard and Board Power Measurements
In order to work with the Zynq TM SoC we used the Zedboard development board. This boards features a XC7Z020 Zynq device, 512MB (2 × 128Mb × 16) DDR3 memory and a comprehensive set of peripherals.
Among the other things, the Zedboard also features a pair of current-sense pin-headers that are used to measure the power consumption of the board. These headers straddle a 10 mΩ, 1%, 1W current sense resistor which is placed in series with the 12 V power supply. The power can be calculated using the following formula:
where V m is the voltage drop (in millivolt) across the resistor.
III. PROPOSED SW IMPLEMENTATIONS
As mentioned in the previous sections, the PS of the Zynq TM comprises dedicated architectures for Single Instruction Multiple Data (SIMD) operations. These architectures are named NEON Media Processing Engines (MPE) or NEON Units and can offer a certain amount of parallelism, with some benefits over the standard CPU approach.
Before implementing hardware accelerators we believe that it is of paramount importance to have a comprehensive knowledge of the capabilities and limitations of the ARM CPUs in conjunction with the NEON units, especially when targeting the smaller devices of the Zynq-7000 family, which have a limited amount of PL resources.
A. NEON Units SW acceleration
The basic concept behind NEON's SIMD technique is that the data to be processed is packed into special wide registers that can hold multiple smaller words. In this way, by specifying a single operation over these registers, multiple data values are processed in parallel using just a single instruction, with benefits over the standard Single Instruction Single Data (SISD) approach.
The potential of this methodology is fully exploited when simple and repeated operations have to be performed on large data sets made of elements that have small word-lengths (up to 32 bits).
NEON units can handle both single precision floatingpoint and signed/unsigned integer data types but not doubleprecision floating-point.
There are four ways to boost the SW performance with NEON Units:
• Using NEON optimized libraries.
• Using automatic vectorization from the compiler.
• Using NEON intrinsics.
• Optimizing NEON assembler code manually. In this work we decided to test NEON units' performance using both the automatic vectorization and the intrinsics methodology.
B. BLAS and NEON intriniscs
The tests were carried out targeting some of the operations specified in the Basic Linear Algebra Subprograms, or BLAS, routines [7] , [4] . We decided to use BLAS routines as these are low-level routines that represent a standard for basic vector, matrix and linear algebra operations. Moreover, BLAS implementations are often optimized for speed on a particular machine and can take advantage of dedicated floating-point hardware such as vector registers and SIMD architectures, as in the case of NEON units.
Some routines are often used to measure performance. For example, the LINPACK Benchmark, which is a common measure of a system's floating-point performance, relies heavily on the GEMM, a Level 3 BLAS routine.
For these reasons, in this work, we translated in C one function from the Level 1 (vector-vector operation) and one from the Level 3 (matrix-matrix operation) from the original Fortran source code (we will refer to these versions as C-BLAS) and later we optimized the code for the NEON units using NEON intrinsics.
C. Implemented routines
The selected routines for our implementation were SDOT and SGEMM, where the prefix "S-" indicates that the operations will be performed on single-precision floating-point elements.
Although the BLAS routines support multiple data-formats, we decided to use floating-point numbers because benchmarking tests are usually referred to floating-point operations.
Since we are using 32-bit words for floating-point elements of the vector, the maximum parallelism achievable with the NEON units is 4 if we use the NEON registers in the 128-bit (Q) configuration.
The target routines perform the following operations: 1) SDOT: produces the dot (scalar) product of two vectors:
2) SGEMM: performs the multiplication between matrices:
IV. NEON UNITS ACCELERATION RESULTS
Each routine was tested in both the C-BLAS and in the NEON intrinsic version using five different optimization options [3] : 1) C-BLAS "as-is", without any optimization from the compiler (the Optimization Level option in the GCC compiler settings was set to -O0.) 2) C-BLAS optimized by the compiler using the automatic vectorization option (Optimization Level set to -O1) [10] . 3) Same as point 2 but with an Optimization Level of -O2. 4) Same as point 2 but with an Optimization Level of -O3.
5) NEON intrinsic version of the code with an Optimization
Level of -O3. This configuration is the fastest and represent the most optimized solution tested. The timer, available in the PS, is used to measure the execution time of each subroutine. This timer runs at half the CPU clock frequency and has clock period of 3 ns. Therefore, the execution time was calculated as follows:
The tests are performed both enabling and disabling the L1-L2 caches in order to evaluate the impact of cache optimization over execution time. Fig. 1 shows the curves obtained for the SDOT function. In these figure the -O2 and -O3 curves overlap as the execution times are almost the same.
A. SDOT results
It is worth noting that there is a considerable speed-up factor between the non-optimized custom version and the version optimized with automatic vectorization. This gap is even bigger if we consider the version optimized with NEON intrinsics. There is not a big difference in execution times, instead, between the various Optimization Levels. The difference between these levels, though, reside in the code size as, increasing the optimization level, the code size increases as well.
The following table shows the average speed-up factors for the various optimizations compared to the non-optimized solution in the case of cache enabled:
The speed-up factor with the automatic vectorization from the compiler is almost four, reflecting the parallelism of the NEON operations. The speed-up obtained with the NEON intrinsic solution almost doubles that value, demonstrating 
Optimization Level
Speed-up factor -O1 (C-BLAS) that an additional improvement can be achieved using a code tailored to the NEON architecture.
Moreover, we can notice that there is a knee in the curves, in particular in NEON intrinsic one, placed in correspondence to arrays length of 8000. One possible explanation is that, being the 32KB L1 and 512KB L2 caches non-inclusive, at this point the L1 hit-rate decrease abruptly and the CPU starts retrieving data from the slower L2 cache. In the cache-disabled case the knee is not visible but the execution times are, by contrast, much higher. Proper caching may, therefore, be a solution to avoid this performance drop. Fig. 2 shows the results obtained for the SGEMM routine. In these figure, the three Optimization Levels -O1, -O2 and -O3 are not distinguishable as we got almost the same timing results for each case.
B. SGEMM results
It can be noted that there is a discontinuity in the trend of the curves for matrices dimensions of 350 × 350. This is reflected by some extent also in the speed-up factors as it can be seen in the next table: 
TABLE II SGEMM OPTIMIZATION LEVELS VS. SPEED-UP FACTORS
This is a cache inefficiency issue since the amount of memory needed to store a matrix of single precision floatingpoint elements with dimensions of 375 × 375 is:
Since the L1 (32 KB) and L2 (512 KB) caches in the Processing System of the Zynq TM are non-inclusive, they can store in total up to 544 KB of data. Matrices with dimensions greater than 350 × 350 exceed that capacity, meaning that the data caches hit-rate in those cases may abruptly decrease if memory access is not handled properly, making the whole process more DDR dependent and, hence, slower.
This may also explain why the trends are more regular before 350 × 350 and suddenly become more jagged after that dimension and also why the discontinuities in the cachedisabled test (in which DDR is used all the time) are not visible (it must be reported, though, that the execution times are much higher in this case).
The tests also show that the speed-up factors obtained for SGEMM function are similar to the SDOT one and that the NEON intrinsic solution is still much faster than the others and that the various optimization levels lead to almost the same results, with minimal differences in speed but with differences in code size.
V. PROPOSED HW IMPLEMENTATIONS
As already hinted, we believe that it is important to have also HW profiling data to compare the two approaches. Therefore, we decided to implement on the PL an architecture that resembled the instructions executed by the NEON units in the SGEMM case.
We took in consideration just the basic A×B multiplication, with α = 1 and β = 0 and without transposing the matrices.
The architectures implemented execute the following tasks:
• Retrieve the input matrices using an AXI DMA and a 32-bit AXI Stream interface.
• For every element of matrix C perform the dot-products between the rows of matrix A and columns of matrix B using different levels of parallelism.
• Output the resulting matrix using the same DMA but with a different AXI Stream interface. The HW implementations were realized using the new Vivado HLS tool [13] and the matrices dimensions were fixed to 32 × 32. In this way each matrix has 1024 floating-point elements. AXI DMA transfers with more than 1024 words cannot be sent with just one transaction and thus require to be split in multiple transfers, increasing the whole communication overhead. Moreover, even with 32×32 matrices, to implement a fully parallel architecture (parallelism = 32) the number of resources required is a considerable percentage of the available resources of the Z-7020 Zynq TM device and the final design occupy almost the entire FPGA (if we take into account the resources needed by the AXI DMA and the interconnections).
Three different solutions were developed to measure the impact of parallelism over the achievable performance. The parallelism values chosen for the three solutions were respectively: 4 (the same parallelism of the NEON architecture), 16 and 32 (the maximum parallelism directly achievable with 32x32 matrices).
Regarding clock frequencies, the PS runs at 667 MHz while the HW synthesized in the PL can sustain a maximum clock frequency of 100 MHz. Fig. 3 shows the execution times in nanoseconds for the various implementations, both HW and SW.
VI. HW VS. SW ACCELERATION RESULTS
As we can see from the histograms, having the same parallelism on both the PS and PL does not lead to the same results. This is because, even if the number of clock cycles in the custom IP core to output the results is almost half compared to the most optimized NEON intrinsic case and almost one third of the automatic vectorizations cases, the clock in the FPGA is much slower than the CPU's one. Therefore, the execution time is much higher in the PL than in the PS.
With a parallelism of 16 in the custom IP core we get almost the same execution time than in the compiler-optimized cases We estimated that, in order to have the same execution time on both the NEON intrinsic case and in the hardware IP case, a parallelism of at least 26 is needed for the hardware implementation. We based this assumption on the following formula:
Once the parallelism of the hardware IP core exceeds this theoretical cross-point the hardware architecture starts being faster than the PS implementation.
The speed-up factors for the three different hardware implementations are summarized in the table below (these factors are calculated in relation to the NEON intrinsic case):
Hardware accelerator parallelism As expected, this performance boost comes with a price. The number of resources needed to implement the parallelism 32 solution is indeed very high. The estimated resources utilization from Vivado HLS for this latest hardware IP solution are about 75% of the available DSP Blocks, 23% of BlockRAMs, 45% of LUTs and 12% of flip-flops. Please consider that the additional resources needed to implement the AXI DMA and the AXI Interconnections are not included in this estimation.
To implement a single floating-point multiplier 3 DSP48E blocks are needed while to implement a single floating-point adder additional 2 DSP48E block are employed. If we take that into account we can figure out that implementing floatingpoint operations on the Programmable Logic may not always be a good solution, also considering that, to match the NEON units performance, a high degree of parallelism is required. In our case we managed to fit in the PL an accelerator faster than the NEON units without effort. This may not be the case for more complex designs to be implemented on the Z-7010 or Z-7020 and, during the partitioning phase of the project, the use of NEON units for floating-point operations instead of an hardware accelerator in the PL is a design choice to consider.
VII. HW VS. SW POWER MEASUREMENTS
The tests were performed targeting three different designs: The board power consumption for the three designs is almost the same. This means that the Zynq TM device has a low impact on the overall power consumption and so it is for the different parallelism architectures. Thus, in this case, to compare the different designs, it makes more sense to evaluate the impact on energy consumption rather than on power consumption.
Therefore, by considering the energy consumption, the faster is the implementation, the lower is the execution time and more energy can be saved. In other words, the fastest implementation is also the most power efficient one and this opens up a new frequency trade-off to take into account: raising the operating frequency may slightly increase the overall power consumption but will decrease the execution time, improving power efficiency. This aspect is very important especially in those cases where many accelerators are employed simultaneously (e.g., data centers) because the energy that can be saved from a single accelerator is multiplied by a large factor.
A final word has to be said about DRAM power consumption. It can be noted that power consumption for the two subsequent idle states (before and after the execution of the application) changes significantly. This is related to the DRAM usage as these memories are responsible for almost one-third of the total run-time and after-run power consumption. All the designs tested use the DDRs available on the Zedboard. These memories are activated and configured when launching the application and remain active even after the application is executed.
Unfortunately, at this moment, it seems not to be possible to dynamically deactivate the DRAM when it is not needed, the only way to do it is to disable it from the beginning, when designing the system within Vivado IDE. This is very inefficient from an energy consumption point of view.
VIII. CONCLUSION
The aim of our work is to develop efficient accelerators for emerging SoC platforms.
In this preliminary work, we focused on benchmarking different acceleration solutions to understand the implementation tradeoffs necessary to develop a library of hardware IP blocks which are loadable on-demand on the PL.
For this purpose, we selected a few common computation intensive routines and implemented them in the Zynq SoC using several paradigms for acceleration: from no acceleration, to acceleration in the NEON units, to customized IP blocks mapped in the PL.
The experimental results show that, for floating-point operations and low levels of parallelism in the PL, the NEON approach is more convenient. By increasing the level of parallelism, we can get better performance in the PL acceleration, at expenses of higher resources utilization.
Regarding power and energy, one of the main reasons to use hardware acceleration is to improve energy consumption and power efficiency. During our tests we found out that, since the power dissipation values are very similar for the various designs implemented and the Zynq has a low impact over the total board power consumption, the faster implementation is usually also the more power efficient one as it requires less energy to complete the execution.
