Abstract-FPGA devices have demonstrated to be one of the most popular prototyping solutions. Nanoelectronics is one of the fields that can prototype its architectures in these devices. Many FPGA vendors have recently included embedded processors in their devices, such as Xilinx with ARM-Cortex A cores, together with programmable logic cells. These devices are known as Programmable System on Chip (PSoC). Their ARM cores (embedded in the processing system or PS) communicates with the programmable logic cells (PL) using ARM-standard interface buses. ARM proposed the Advanced Microcontroller Bus Architecture (AMBA) as an open-standard. Its third generation included the Advanced eXtensible Interface (AXI) to reach higher performances. In this paper we analyse the performance of exhaustive data transfers between PS and PL for a Xilinx Zynq FPGA in a co-design real scenario for Convolutional Neural Networks (CNN) accelerator. This CNN accelerator processes, in dedicated hardware, a stream of visual information from a neuromorphic visual sensor for classification. In the PS side, a Linux operating system is running, which recollects visual events from the neuromorphic sensor into a normalized frame, and then it transfers these frames to the CNN accelerator of multi-layered CNNs; and read results, using an AXI-DMA bus in a per-layer way. As these kind of accelerators try to process information as quick as possible, data bandwidth becomes critical. Maintaining a good balanced data throughput rate requires some considerations, such as data partitioning techniques to balance RX and TX transfers, or different transfer management techniques: polling versus dedicated interrupt-based kernel-level driver. For longer enough packets, the kernel-level driver solution improves global computation timings in a CNN classification example. Kernel-level driver ensures safer solutions and enables OS tasks, scheduling for better computation distribution.
I. INTRODUCTION
FPGA (Field Programmable Gate Array) devices with onchip processing system, known in the literature as SoC FPGA or PSoC (Programmable System on Chip), have recently emerged as potential solutions for compact processing applications. PSoCs combine the better of two worlds. They have a familiar processing system development interface for sequential algorithms or embedded OS applications, and at the same time, they provide an empty landscape for custom hardware development that enlarges the system's application set. PSoCs also offer a flexible programmable alternative for sequential processing, implementing any hardware function to augment the capabilities that PS owns. In fact, due to the inherently parallel nature of the FPGA, multiple hardware blocks can operate simultaneously, either in parallel, when the logic is replicated, or in a pipelined stage. These capabilities open up a wide range of possibilities for applications that can be deployed in these systems. PSoCs can be found in different applications where main and lighter tasks are processed by the PS, while harder computational tasks are designed to be deployed on the PL. Some examples can be found in memristors simulations, where ASIC manufacturing requires prior multiple simulations to ensure the viability of the proposed architecture and to warranty the correct design and its success [1] . We can find other applications such as automotive [2] , [3] , image and video processing [4] , medical [5] , high performance computing [6] , [7] , [8] , [9] , [10] . In this paper a performance evaluation over Xilinx PSoC memory transfers is presented and tested for a CNN accelerator application [11] . It consists of a user-level driver with several improvements, against a kernel-level driver.
This paper is organized as follows. Section II describes briefly the Xilinx Zynq PSoC architecture and enumerates the interfaces between the PS (Processing System) and the PL (Programmable Logic) in the Zynq. Section III explains the AXI DMA transfer flow developed at both user-level and kernel-level drivers, while section IV presents the transfer timing results of each scenario. Finally, section V presents the conclusions.
II. XILINX PSOC PLATFORM
Zynq chips from Xilinx are PSoCs architectures, which contain an ARM-Cortex A family processor and a reprogrammable logic (FPGA) in the same chip. These chips represent a new co-design solution where the embedded OS (Linux) in the ARM cores executes software tasks (eg. data normalization, data recollection from sensors...) and the reconfigurable logic implements a design in order to accelerate a specific application.
Interconnection between PL and ARM processor is done through AXI bus. This interconnection can be managed through an IP block that configures different interfaces of the ARM core (I2C,SPI,..) and interfaces from PL, such as AXI, clock speed,... [Fig 1] . The current version is AXI4, which is part of the ARM AMBA 3.0 open standard [13] . This AMBA standard was originally developed by ARM for microcontrollers but then it was extended for SoCs, including PSoCs, and it is an optimal interconnection technology between PS and PL. There are three different types of AXI4, each of which represents a different bus protocol. AXI4 [12] is oriented to memory-mapped links and provides a high performance. AXI4-Lite [12] is a simplified link supporting only one data transfer per connection (no burst). AXI4-Stream [13] is oriented to high data flow applications with DMA support. Unlike the previous ones, this option does not implement any handshake protocol. is able to manage all MMP needed power supplies (from 1V to 12V), the JTAG port over UART and several parallel interfaces to Neuromorphic chips over the CAVIAR and ROME parallel AER connectors [14] . The DockSoC can act as a daughter board for the AERNode [15] platform to expand the connectivity to other PSoC platforms and/or to support the connectivity to other Neuromorphic systems. Figure 2 shows a picture of the used setup with the PSoC platform, the DockSoC baseboard and a USB neuromorphic retina, called DAVIS. The DAVIS [16] is a dynamic vision sensor that measures luminosity changes independently per pixel and sends out events to signalize which pixel has detected such change in time over a configurable threshold. By collecting a fixed number of events from this sensor, a histogram of those events can be used as a frame to be computed by the CNN accelerator running in the platform. This allows to improve the frame-rate, over conventional CCDs, to 2ms/frame.
III. AXI DMA COMMUNICATION
PL can be connected to the ARM processors by multiple interfaces as was mentioned before. However, the AXI4-Stream modality, commonly called AXI-DMA, has the best performance high amount of data transactions. AXI-DMA consists of two different buses: Memory Mapped to Stream (MM2S) and Stream to Memory Mapped (S2MM). MM2S reads from DDR memory (off-chip) and transmits data to PL, while S2MM writes data from PL to DDR memory. The DMA architecture presented in this paper contains two modules that were designed to adapt S2MM and MM2S interfaces data flow to and from the CNN accelerator implemented in the PL, called NullHop [7] .
NullHop is a hardware accelerator designed for multilayered CNNs execution for deep-learning classification applications. It resides in the PL and it needs to receive both the visual input (feature maps for a particular layer, or a portion of it) and the parameters (convolution kernels) from the PS, to calculate the results (output feature maps). It was designed with 128 MAC blocks to work in a streamed way. Once the accelerator has received the parameters, the visual input is streamed in taking advantage of DMA. After a couple of rows are received, the MACs start to operate and to produce an streamed output, which is sent back to the PS. To extract the maximum performance in our PSoC system, it is needed to properly coordinate the data flow in the application. When an OS is managing the PSoC, there are two different memory spaces: the virtual one, where the user application works, and the physical one, which is managed by the DMA controller, and therefore, visible by the hardware implemented at the PL. User app works at virtual space, while DMA controller at PL works with physical one. The API and/or driver do the transfers to/from both spaces. Figure 3 shows the memory hierarchy from the user application to the CNN accelerator. On embedded Linux OS, there commonly exist two ways to communicate with devices: (1) user-level: using the function mmap() to map a view of the device physical address space into our process virtual address space. This function is called by the user application directly, at user-level and the DMA transfers can be configured in a polling scheme, where the user application is frequently blocked, waiting for the transfer to be completed to process the data; or (2) kernel-level: a piece of software running at a higher privilege level of the OS, with interrupt support, in order to liberate the user application of blocking states until the data are ready, allowing the execution of other needed tasks. Furthermore, the kernel-level ensures the integrity of the software avoiding the possible misuse of physical address spaces reserved to other processes running in the OS.
In this work, a performance comparison between these two different communication schemes is presented. Furthermore, two different operating modes for the user-level driver were included in the study: a completely polling-based solution, which would have the lowest latencies in between DMA transfers, and a scheduled solution, where DMA transfers are not continuously blocked.
A. User-level
We compared two read/write buffer implementations: single and double buffer. The first one establishes only one channel for data transfers between virtual and physical memory. The double buffer implementation reserves two buffers in memory for virtual-to-physical transfers: while one is used for data ready to be sent to PL, the other one is used to prepare data for the next transmission. This second implementation allows reducing overhead latencies at OS level. Apart from buffers implementation, two user-level driver operating modes were implemented: Unique and Blocks. Unique mode sends all the data at once to the buffer, without any kind of partitioning. On the other hand, Blocks mode divides data into smaller chunks of data to make better advantage of the double buffering. Furthermore, two user-level versions were compared: one completely based on polling, and a second one closer to the kernel-level scenario explained in the following subsection, where a scheduler manages the different DMA requests, to avoid dead-lock waits.
B. Kernel-level
In order to have the OS with a higher flexibility to attend other tasks for a realistic scenario, we implemented a kernellevel driver that uses interrupts to manage the configuration of new DMA transfers when they are needed, allowing the PS to work in other tasks in the meantime. In this case, at user-level, the software specifies to the driver, at kernel-level, where all the data are placed; then, the driver moves these data from virtual to physical space, and it configures the needed DMA transfers. In this case, we used the AXI-DMA driver provided by Xilinx, which supports AXI-Stream DMA transfers with the needed length, or dividing them into small pieces and queuing them into consecutive transfers (known as Scattergated mode). To use the AXI-DMA Xilinx driver, a kernellevel API was developed to adapt the driver to our needs.
IV. RESULTS
We tested the PSoC under two different scenarios: (1) with a hardware in a loop-back connection at PL that takes data from MM2S and stream it back to the S2MM interface of the DMA controller; and (2) a CNN execution using the NullHop accelerator at PL, executing the RoShamBo CNN [7] . Figures  4 and 5 show the results of the first scenario. TX and RX transfer times evolution is presented for an incremental data size buffers from 8bytes to 6Mbytes considering the user-level driver with polling, the scheduled user-level and the interruptbased kernel-level driver.
For the loop-back streaming, TX and RX buffers may be full at the same time, thus requests for reading RX buffers may occur at the same time as new TX requests are produced. Since DDR memory cannot attend read and write operations at the same time, the bandwidth balance between RX and TX transfers is important in order to avoid blocking states of the system, e.g. a longer enough TX transfer can fill up the RX hardware buffer and stop the TX transfer, blocking the system if RX and TX transfers are not properly managed. In these figures, it can be seen that TX transfers have slightly higher priority than RX transfers, with TX transfers obtaining smaller latencies than RX transfers. the kernel-level driver approach, due to its bigger overhead at software execution because of the AXI-DMA Xilinx driver and the API, produces bigger latencies for smaller data lengths than the user-level approach, but it increases the performance for bigger data lengths. The user-level solution with polling and without scheduling obtains slightly better results, although it could lack on blocking the system while the transfers are done. For the second scenario, we set up the RoShamBo CNN execution in the MMP platform OS in the same way as described in [7] , modifying the software to use one of the three modes to control the memory transfers between virtual memory to physical memory and to manage the DMA transfer as described above. In this test, we used the single-buffer configuration and the Unique mode. Table I shows the obtained timings for this case. The lowest latencies are obtained for the user-level mode with polling. This is possible with this relatively small CNN because transfer lengths are not longer enough to block the system. In [7] bigger CNN were tested, such as VGG19, where this mode cannot be used, blocking the system. With the second mode, without a kernel-level driver, but introducing a scheduler in the OS to avoid blocking the system, the latencies increases less than 2 ns per byte for TX and less than 150ns for RX. When the kernel driver is used, the latencies increase around 6 ns/byte for TX, but they decrease with respect to the use of the scheduler, being less than 100ns slower than user-level. Regarding the whole frame computation time, which requires the execution of 5 convolution layers in the NullHop, and therefore, sending and receiving DMA transfer for each layer, the latencies are bigger for the kernel-level driver, followed by the scheduler at userlevel and then for the user-level. This behavior is correctly expected since transfer lengths for RoShamBo CNN are in the order of 100Kbytes, where kernel-level driver is still not obtaining its best results, as depicted in figures 4 and 5.
V. CONCLUSIONS This paper presents and evaluates different implementations at software level of data movements between virtual memory space of an OS at user level, and physical memory space at kernel level for DMA transactions between the PS and the PL of a Xilinx Zynq PSoC for CNN executions. From the implementation at user-level privilege of the OS, using a polling solution, with less memory protection, to highest protection, using a kernel-level driver with interruptions, through an intermediate solution at user-level using an scheduler, this paper has evaluated two different scenarios: a real one under the execution of a CNN, to play RoShamBo with the NullHop CNN hardware accelerator, and a synthetic one, to extract the performance characteristics of the different implementations.
User-level solutions give better latencies for data transfer lengths bellow 1Mbyte, but they lacks on flexibility for multithreading programs due to intensive use of polling. Their maximum supported transfer lengths are 8Mbytes (AXI4-Stream limit), although for long transfers the performance decreases due to long polling stages.
Kernel-level solution, tested for the worst possible case (single buffer scheme and unique data transfers), obtains similar latencies for bigger data transfer lengths. For the RoShamBo test, since transfer lengths are in the order of 100Kbytes, the user-level polling solution performs better as it has a smaller software overhead.
VI. ACKNOWLEDGMENT
This work was partially supported by the NPP project funded by SAIT (2015 SAIT ( -2018 and by the Spanish government grant (with support from the European Regional Development Fund) COFNET (TEC2016-77785-P). The work of R. Tapiador has been supported by a Formación de Personal Investigador Scholarship from the University of Seville.
