ABSTRACT This paper presents an enhanced efficiency 3-D convolution operator based on optimal field programmable gate array (FPGA) accelerator platform. The proposed system takes advantages of the intermediate data delay lines, implemented in an FPGA, to avoid loading repetition of the input feature maps. This 3-D convolution accelerator performs 268.07 giga operations per second at 100-MHz operation frequency, with 330-mW power consumption. We experimentally demonstrate the enhanced efficiency of the proposed convolution accelerator, in comparison with the conventional technologies. The proposed 3-D convolution accelerator may find interesting applications in neural networks and video processing.
I. INTRODUCTION
Over the past few decades, the Convolutional Neural Networks (CNN), as an advanced machine learning algorithm, have found numerous applications, including computer vision and speech recognition [1] - [3] . Despite growing use of the two-dimensional convolutional neural networks, they still suffer from inefficiency of CPUs, showing up during the implementation procedure [4] . To overcome this issue, various accelerator technologies have been recently proposed, including Graphics Processing Unit (GPU), Application Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA).
GPU is one of the best platforms for accelerating the convolutional neural networks, especially while the training process is involved [3] , [5] , [6] . A single GPU presents an outstanding performance, but its high power consumption obstructs its application in embedded systems. Moreover, GPUs are restricted by the inflexible parallel calculation units, and require reforming adaptable calculation models. Another technology, ASIC chip, is well-known for its superior performance and high energy efficiency [7] - [9] , while it suffers from less flexibility and high developing cost.
FPGA is a promising CNN acceleration platform which shows superior performance over the aforementioned technologies. Recently, FPGA-based accelerators have spurred huge attention since they provide high performance, good energy efficiency, short development cycle as well as the reconfigurable characteristics. Previous works on the FPGA-based 2D CNN accelerators have shown increasing throughputs [4] , [10] - [12] . Peeman et al. focused on maximization of the reuse of on-chip data [10] . Concentrating on the optimization of both resources and communication bandwidth, Zhang et al. presented an implementation that achieved a peak performance of 61.62 Giga Floating-point Operations Per Second (GFLOPS) under 100MHz operation frequency on a Virtex-7 FPGA [4] . In [11] , the authors leveraged all sources of parallelism in CNNs and their work ran 84.2 GFLOPS on a Virtex-7 FPGA. Ovtcharov et al. reported a specialized hardware based on the Xeon and FPGA co-processing for data centers to reach the speed of 233 images per second with Catapult Server and Arria 10 GX1150 FPGA [12] . As for 2D convolution accelerators, in [13] the authors presented a MultiWindow Partial Buffering (MWPB) scheme for 2D convolution operators and achieved a speed of one pixel per clock. Carlo et al. dramatically decreased the area required by MWPB scheme for a better integration with embedded systems [14] . Work in [15] reduced the look up table resource usage in systolic array architecture. References [16] and [17] separated the 2D convolution to reduce the computational complexity. Besides, a speed of one pixel per clock is achieved in [16] , and a speed of 194 frames per second (f/s) is achieved in [17] . However, FPGAs accelerate pre-trained 2D CNN models and the convolution operations may occupy over ninety percent of the computation time due to the convolution calculation burden [18] .
Exploiting the aforementioned acceleration approaches, the 2D CNNs have been widely developed. Meanwhile, the 3D CNN calculation module has been emerged and found various fascinating applications, including targets tracking and human action recognition [19] . The 3D convolution takes advantage of one-time capturing, where more information of multiple contiguous frames are obtained in one shot. However, this represents a greater computational burden, especially for the embedded real-time video processing. As a result, the main challenge of such a system is to reach an acceptable performance and power consumption. The 3D convolution based on FPGA accelerator may stand as an alternative technique to overcome these issues, however, few studies have been reported in this area.
In this paper, a 3D convolution operator based on optimal FPGA accelerator is presented. The optimal operation is achieved by taking advantages of the Intermediate Data Delay Lines (IDDL) to avoid pixels loading repetition. Moreover, the proposed system exhibits high calculation performance and low power consumption due to the tailored hardware of the 3D convolution. Specifically, we make the following contributions. We propose a 3D convolution accelerator design with IDDLs and tailored hardware. As a case study, we implement a 3D convolution accelerator that achieves a performance of 268.07 GOPS and 2.65× that of the fastest commercially available GPU.
The paper is organized as follows. Section II describes the operation principle and properties of the proposed 3D convolution. Section III presents the optimization and implementation of the FPGA accelerator. The experimental demonstration of the proposed system will be given in Section IV. Finally, Section V concludes the paper.
II. OPERATION PRINCIPLE
Figures 1(a) and (b) illustrate the generic representation of the 2D and 3D convolutions, respectively. To perform a convolution, the convolution kernel slides over the domains of the input feature maps and generates the output feature maps. The 2D convolution extracts the independent features from a sequence of images with several 2D convolution kernels sliding along the M and N axis as shown in Fig. 1(a) . In contrast, the 3D convolution captures both spatial and temporal information using a 3D convolution kernel sliding along the M , N and S axis as shown in Fig. 1(b) . In 3D convolution, for given S input feature maps with the size of M × N , the pixel cubes are extracted from the input feature maps and then convolve with the 3D convolution kernel. The pixel value at the position (m, n, s), on the output feature map, is given by [19] 
where P, Q and R represent the height, width and length of the kernel, respectively. Besides, w pqr shows the value of the kernel at the position (p, q, r), and for p, q and r equal to zero, i mns is the pixel value of s th input feature map at the position (m,n). Three convolution styles, valid, same and full, operate based on different boundary processing methods [20] . Here, we use a valid 3D convolution operation, whose convolution kernel is only allowed to visit the domains where the kernel is contained entirely within the input feature maps. For this 3D convolution style, the number of the output feature maps is (S-R+1), with the size of (M −P+1)×(N −Q+1). Figure 2 presents the pseudo code of the valid 3D convolution.
Next, we evaluate the computational complexity of the 3D convolution represented by the number of multiplication and additional operations. To form a single output pixel using a P × Q × R kernel, we have P × Q × R multiplications, (P × Q × R − 1) addition operations and P × Q × R input pixels loading. For valid convolution, we may calculate the total number of the operations. Table 1 presents the computational complexity of the three elemental operations in a 3D convolution. As a result, to convolve simple feature maps with a small kernel, over one million operations are required. Therefore, to achieve real-time performance, i.e. thirty feature maps per second, a computational power of several GOPS is required. In fact, the presence of an extra, temporal, dimension in 3D convolution introduces massive operations. 
III. OPTIMIZATION AND IMPLEMENTATION
The 3D convolution operation in Fig. 1(b) may be represented as the summation of R× 2D convolution operations. In an FPGA accelerator, the 3D convolution operation may be achieved by R× 2D convolvers, same as 2D convolution operators, plus an adder tree. Figure 3 shows the operation principle of the basic 3D convolution. Due to the limited resources of the FPGA-based 2D convolvers, the pixels will not be permanently stored in the FPGA storage resources. As a consequence, in a 3D convolution operation formed by R× 2D convolvers, the input feature maps are required to be repeatedly loaded. The loading repetition takes much time which highly affects the acceleration performance, and consequently, enforces reducing the loading times of the data transmission. Figure 4 depicts the general schematic of the 3D convolution using Intermediate Data Delay Lines (IDDL). The IDDLs in an FPGA are utilized to temporally store the intermediate data, to avoid the feature map loading repetition. As a particular case, a kernel with the length, R, of 3 is shown in Fig. 4 , where w pq1 , w pq2 and w pq3 are denoted as A, B and C. Three input channels, CH1, CH2 and CH3 are interconnected with three 2D kernels A, B and C. Besides, nine 2D convolvers are required, each of which labelled by X i , where X ∈ {A, B, C}, denoting the kernel convolved with the input feature maps. The subscript i = 1, 2, 3 represents the channel in which the input feature map is loaded. Once the three input feature maps are loaded, each 2D convolver provides an output, and then, nine 2D convolution outputs are generated. We define x j i as the 2D convolution output of the j th feature map convolving with X i , where j = i + 3n, S − 3 < 3n < S, ∀n ∈ N and x ∈ {a, b, c}.
A. OPTIMIZATION OF 3D CONVOLUTION OPERATION
Loading Fig. 4 , significantly reduces the data loading times. All input pixels are only required to be loaded once. Table 2 presents the comparison between the loading times in basic and optimized 3D convolution operations. If R reads three, the basic loading times are almost three times the optimization ones. The output feature map corresponds to the time loading, while the optimized operation may provide all corresponding output feature maps. Figure 6 illustrates the data flowchart of the optimized 3D convolution accelerator based on the FPGA. We implemented the proposed accelerator in a ZYNQ chip, which is a programmable SOC from Xilinx. The number of the transmission channels in the Processing System (PS) and Programming Logic (PL) is equal to the length of the convolution kernel, which is considered to be either 3 or 4. Prior to the data processing, all input feature maps are stored in an external memory by PS. To calculate the 3D convolution, three or four, input feature maps are loaded to the PL for one time, and the pixels of the input feature maps are loaded in the raster scan. The 3D convolution operation is performed in the PL, which is consists of three parts, including Direct Memory Access (DMA), input-output buffer and 3D convolution architecture. As soon as the DMA module receives the data streams of the input feature maps, input buffers temporarily store the data streams. Then, the input buffers transmit the data streams to the 3D convolution architecture. Next, the architecture provides the output feature maps by convolving the input data streams. Then, the output feature maps are transmitted to the PS through the DMA modules, while they are temporarily stored in the output buffers. Figure 7 presents the optimized 3D convolution architecture. The proposed 3D convolution is composed of various segments including the kernel caches, an array of 2D convolvers, IDDLs, adders, a data output controller and data interconnect bus. The kernel caches permanently store the 3D convolution kernels, to be accessible for the 3D convolution architecture. Each 2D convolver is directly connected to an input channel. Figure 8 shows the details of the 2D convolvers, used in the 3D convolution in Fig. 7 , where a full buffer scheme is adopted [13] . In this scheme, (P-1) FIFOs with the length of (N-Q) are employed to temporally hold the data before they reach to the 2D convolvers. We use P sets of right shifters, each of which consists of Q registers to assemble the P×Q convolution windows. The window operation is the sum of inner pixels multiplied by the corresponding weights of the kernel. Once a new pixel is loaded, the convolution window automatically moves to the next position. To accomplish the parallel operations, the loop kernel height and weight are fully unrolled and pipelined, which may significantly enhance the throughput.
B. OPTIMIZED 3D CONVOLUTION ACCELERATOR IN FPGA
Moreover, in Fig. 7 , the IDDLs are used to avoid loading repetition of the input feature maps, where the length of the IDDLs reads (M − P + 1) × (N − Q + 1). The IDDLs are shift registers and may be implemented with flip-flops or blockRAMs depending on the specific FPGA device. The current outputs of the 2D convolvers are stored in the IDDLs until the loading of the next feature map. To reduce the power consumption, the 2D convolvers are kept off except those whose corresponding IDDLs are full. The data interconnection bus is the data transfer path between the various parts of the structure. The adders sum the relevant outputs and are switched off unless they read valid inputs. This design provides (S −R+1) output feature maps, while simultaneously, reduces the power consumption. The adders have different priorities determined by the data output controllers. In this architecture, the adder which is connected to IDDL 1 has the highest priority. The arrangement of the priority levels provides a basis for PS, judging the order of the output feature maps.
IV. EXPERIMENTAL DEMONSTRATION
This section experimentally demonstrates the performance of the implemented accelerator. First, we show the resource usage of the accelerator under different pixel precisions. Then, we present the performance of the designed structure, and compare it with other proposed technologies. Figure 9 shows the photograph of the realized accelerator using the Xilinx ZC706 developing board. This board is composed of the Xilinx ZYNQ, itself consists of a Kintex-7 FPGA and a dual ARM Cortex-A9 Processor, and 1GB DDR3 memory with the frequency bandwidth up to 4.2GB/s. The results are achieved using the Xilinx Vivado-2016.1 developing software. The design specifications are as follows. The input signal is a 30f/s low resolution video, where the resolution of each frame is 256×256. Various pixel precisions are chosen, i.e. 8, 10, 12, 16, 32-bit signed and 32-bit float. The kernel lengths are chosen as 3 and 4, where the kernel height×weight are considered as 3×3, 5×5, 7×7, 9×9 and 11×11. Table 3 lists the resource TABLE 3 . FPGA resource usage for the designed 3D convolution processor prototype.
FIGURE 9.
Photograph of the implemented prototype using Xilinx ZC706.
utilization of different pixel precisions with various kernel sizes. Flip-flops and look-up-tables represent the most basic resources in an FPGA. Moreover, the BRAM-18K and DSP48E are the storage and computation resources, respectively. The bigger the kernels, the higher the pixel precision, the more resources are utilized. For the pixel precision of less than 16 bits, one pixel multiplication is performed with one DSP48E. Considering the 16-bit signed pixel precision, a pixel multiplication is performed with two DSP48Es, while for the pixel precision of 32-bit signed and 32-bit float, a pixel multiplication is performed with four and eight DSP48Es. Due to the resource constraints, here, we only present the situations, which are implementable with the 32-bit signed and 32-bit float pixel precisions.
Let us now compare the proposed FPGA-based 3D accelerator with the CPU and GPU. The CPU platform is Intel Dual Core i7-6700K CPU, at 4GHz with a 32GB RAM. The GPU platform is characterized as NVIDIA GTX1080, possessing 2560 CUDA cores with an 8GB GDDR5 256-bit memory. The operating system for CPU and GPU is Ubuntu 16.04 with Keras deep learning software framework library. With the Keras framework, we may easily select the CPU or GPU to run the 3D convolution. The precision of the pixels and kernels in software is 32-bit float. A 100MHz system clock is considered for the FPGA-based 3D accelerator.
The latencies of various pixel precision have several clocks difference, which may be hided with the pipelined design instead of totting-up when calculating the 3D convolution. Therefore, the computation time difference of the same kernel size with different precision can be less than one percent. Since the calculation of the same kernel size under different precisions consumes a same computation time, Table 4 only lists the experimental results with various kernel sizes. As we see in this table, The FPGA-based 3D accelerator deals with the thirty gray feature maps only in 5.9 ms, with the size of 256×256 convolving with a 11×11×4 kernel and reaches the speed of 268.07GOPS. The FPGA-based 3D accelerator presents a computational performance 14 times faster than that of the CPU and slightly faster than that of the GPU in average. Figure 10 provides the details of the acceleration specifications. The GOPS ratios of the proposed accelerator over the CPU and GPU increase with the kernel size. The maximum GOPS of the proposed convolution accelerator is 25.19× and 2.65× that of the CPU and GPU with the 11×11×4 kernel. The performance of the GOPS per watt for FPGA significantly outperforms that of the CPU and GPU.
Our work achieves the speed of one pixel per clock and 182f/s. For the computational complexity of 3D convolution is R, namely the length of the kernel, times that of 2D convolution, the calculation performance of our work is maximum four times that of the 2D convolution accelerators in work [13] - [16] and 3.75 times that of the work [17] .
V. CONCLUSION
We presented an efficient 3D convolution operator based on the FPGA accelerator. The proposed structure significantly improves the convolution performance. The accelerator is characterized by an array of parallel 2D convolvers interconnected with IDDLs or adders, and other special-purpose hardware. The key attribute of the proposed processor is the implementation of the IDDLs to avoid the loading repetition of the processing feature maps. The FPGA-based accelerator presents the speed of 268.07GOPS and slightly faster than that of the fastest commercially available GPU. In future, we plan to apply the proposed FPGA-Based 3D accelerator to accelerate the 3D CNN models. The proposed accelerator may provide a higher throughput in a larger model.
