Based on the application of convolutional neural network (CNN) in the field of image recognition and the characteristics of a large number of computing requirements, this paper designs an accelerator based on SDSoC (Software-defined on-chip programmable system). The key parameters of the CNN training structure file and the selection of the appropriate excitation function ReLU (Rectifiedlinearunit) for training the convolutional neural network on the virtual machine are mainly modified. Finally, a CNN hardware accelerator with shorter image recognition time and high precision is realized. The experimental results show that compared with the traditional CPU, the recognition accuracy is increased to about 78%, and the recognition time is shortened from 10 seconds to the millisecond of the general CPU. 
INTRODUCTION
Deep learning has been widely used in image processing, speech recognition, advertising push and many other aspects. As a kind of deep learning, CNN is characterized by reducing the dimension of recognition images and improving the training performance of forward calculation [1] . On the one hand, CNN shared convolution kernel, which is beneficial to improve the processing of high-latitude data; on the other hand, it is not necessary to manually select feature values, and the automatic feature classification effect is obvious, which makes CNN widely used in language analysis and image recognition. At present, CNN is mainly used in the speech engine of the University of Science and Technology, the voice assistant of the smart phone, and the identification of the license plate in Baidu Map Street View [2] .
In summary, the main function of CNN is currently implemented in the software environment. In the CNN convolutional network structure, since multiple convolution operations are performed, the general-purpose CPU cannot perform parallel processing of multiple convolutions due to the characteristics of its own serial timing execution command. The field-programmable gate Array (FPGA) has the characteristics of parallel processing data, and provides a research platform for the design of CNN accelerator [3] . Based on the analysis of the working principle of CNN, this paper customizes the network structure file and obtains the eigenvalues of the convolution layer after more than 100,000 trainings. Finally, the accelerator design is completed on the SDSoC platform, and the CPU is compared in terms of recognition time and resource consumption. Finally, the future research is prospected.
CNN NETWORKSTRUCTURERECONSTRUCTION

Custom Network Structure
The traditional CNN consists of a convolutional layer, a pooled layer, a fully connected layer, an activation functionlayer, and Softmax and loss layers. This paper designs a custom network structure file that can reduce hardware resources. The core algorithm is to remove the pooling layer and use two layers of convolution layers. Through the multiple adjustment training of the benchmark learning rate, in order to improve the recognition accuracy. The improved CNN structure is shown in Figure 1 . 
Training Eigenvalues
Activation function selection In the selection process of the activation function, the sigmoid function in the original network structure file was modified, and the ReLU excitation function was selected. There are two common excitation functions, and the corresponding function formula is:
(1)
Comparing the complexity of these three formulas, it is obvious that ReLU is the easiest to implement when hardware logic is implemented. Therefore, under the premise of ensuring accuracy, the ReLU excitation function is preferred. Assuming that the input of the neuron is (x1, x2, x3, ..., x), the weight value corresponding to each input signal is (w1, w2, w3, ..., w), the paranoid value is b, and the excitation function is f. , then the calculation process of each neuron is a formula:
Training eigenvalue results In this article, the virtual machine has been trained a total of 120,000 times. By adjusting the benchmark learning rate and the number of trainings, the target recognition accuracy is 78%. As shown in Fig. 1 , when the benchmark learning rate is 0.0005 and the number of trainings is 16,000, the results of the training are tested using the test structure file. 
DESIGN OF CNN ACCELERATOR BASED ON FPGA
This chapter will first explain the overall logic design of the accelerator design, and then verify the timing of the CNN accelerator through the modelsim simulation software.
Based on SDSoC Accelerator Design
Pragma is the core library of SDSoC. It plays a vital role in the port constraint of the system. On one hand, it can constrain the properties of the memory, which indirectly affects the connected port. On the other hand, it can directly constrain the type of the port. For the data transfer part, you can constrain the size and type of memory. The data acceleration part needs to declare the acceleration library in the front, which greatly reduces the development time. In addition, Pragma uses highlevel synthesis (HLS) keywords for performance optimization. As shown in Figure  2 , it is the design flow chart of the convolution accelerator of this paper.
The accelerator uses the PIPELINE, INLINE, UNROLL, and AARRY_PAPTITION keywords. These keywords enable parallel execution of functions or loop operations, thus reducing the startup interval. Function in lining, achieving optimization across the function editing logic, reducing call overhead; expanding the for loop to improve parallelism between loop iterations; split large arrays into multiple small arrays to increase data parallel access. Figure 4 shows the basic structure of the FPGA acceleration data. The interaction between the PS and PL can effectively improve the data processing speed. There are many data transmission methods from A to B. This paper uses the AXI bus mode, which greatly improves the data processing efficiency at both ends. The PS side is the data control end PL end for parallel processing of the data end. The experimental platform of this paper connects the memory chip DDR3 to the PS side. This figure is the overall data acceleration structure.
Module Simulation
Buffer Module Simulation The buffers commonly used in FPGAs mainly include RAM and FIFO. In this design, data transmission is involved in cross-clock domain. FIFO has a unique advantage in handling this situation. It only needs to control the enable signal. Compared with RAM, the address can be reduced. Signal control. The simulation results are shown in the following figure: As shown in the figure, an IP core is retrieved in this module, and the synchronous clock is used to process the signal, wherein sclk is a 50M clock signal, data is valid data in the module, wr_en and rd_en are read and write enable signals, full and Empty is an empty flag, which indicates whether the data buffered by FIFO is valid.
EXPERIMENT AL ANALYSUS Experimental Environment
The experimental software platform virtual machine of this paper selects VirtualBox, the Linux system adopts Ubuntu operating system, the trained CPU processor is Inter i5 series, DDR3 8G memory, host 4G virtual machine 4G; hardware simulation tool is modelsim, hardware acceleration is Xilinx ZYNQ7000 series XC7Z020-1CLG400 acceleration board, memory 512MB; acceleration algorithm is completed in C language.
Experimental Results
Using the debug function of SDSoC, you can perform performance event tracing (TCF, Trace Function) analysis on the entire chip, supervise resource usage and accelerate the targeted optimization. Figure 6 . Resource occupancy report. Figure 6 shows the resource usage of the FPGA. The operating frequency of the FPGA is 200 MHz. Since the network structure file and the layer convolutional layer are modified to share the same acceleration function, the BRAM and DSP resources are reduced. Among them, BRAM uses only 62% of 62% DSP resources. The rest are as shown.
FPGA Resource Usage Report
Using Xilinx's vivado development tool, you can clearly see that the entire engineering chip consumes about 2.5 W, and the general-purpose CPU consumes about 60 W, Compared to about equal to 95%.
CONCLUSIONS
It can be seen from the above statistical experimental data that the pipeline data processing of the Pragma used in this paper greatly improves the convolution calculation speed of the convolutional layer of the convolutional neural network, and fully exploits the characteristics of hardware parallel processing and passes Custom network training structure files reduce the use of hardware logic resources, increasing recognition accuracy to 80%. The accelerator has a short recognition time, high resource utilization, and achieves substantial acceleration processing. However, due to the overall idea, in order to reduce the logic resources of the hardware, the accuracy has a certain loss, and the accuracy of the reduction of the convolutional layer has an influence. In the next work, the recognition accuracy will be further improved. Finally, thanks to Technology Innovation Center of Agricultural Multi-Dimensional Sensor Information Perception, Heilongjiang Province for supporting this paper.
