Abstract-Two-dimensional (2-D) convolution is a widely used operation in image processing and computer vision, characterized by intensive computation and frequent memory accesses. Previous efforts to improve the performance of field-programmable gate array (FPGA) convolvers focused on the design of buffering schemes and on minimizing the use of multipliers. A recently proposed recurrently decomposable (RD) filter design method can reduce the computational complexity of 2-D convolutions by splitting the convolution between an image and a large mask into a sequence of convolutions using several smaller masks. This brief explores how to efficiently implement RD-based 2-D convolvers using FPGA. Three FPGA architectures are proposed based on RD filters, each with a different buffering scheme. The conclusion is that RD-based architectures achieve higher area efficiency than other previously reported state-of-the-art methods, especially for larger convolution masks. An area efficiency metric is also suggested, which allows the most appropriate architecture to be selected.
I. INTRODUCTION
T WO-DIMENSIONAL (2-D) convolution is widely used in image analysis and computer vision. It is also increasingly found in the latest generation of deep-learning systems for image categorization [1] . Convolution can be computationally intensive: for an R × S convolution mask, R × S multiplications, R × S − 1 additions, and R × S accesses to the input data are required for processing each pixel. In modern computer vision and deep convolutional networks, dozens to hundreds of convolution masks may be required.
A field-programmable gate array (FPGA) represents a good choice of device for performing 2-D convolution because of the ability to fully exploit the inherent parallelism involved in this spatial operation. Designs for efficient FPGA implementations-achieving, for example, a throughput of 1 pixel/clock-mainly focus on two aspects: the design of the buffering scheme and improving the convolution kernel module. A good buffering scheme [2] , [3] can lead to reduced onchip resources by limiting the number of input buffers based on an acceptable external memory bus bandwidth. Specifically, a full buffering (FB) scheme [2] would have an optimal external memory bus bandwidth of 1 pixel/clock, while a single-window partial buffering (SWPB) scheme [2] requires the least amount of on-chip resources; a multiwindow partial buffering (MWPB) scheme [3] is a tradeoff between these two extremes.
A diverse set of ideas has been explored to reduce the complexity of the convolution kernel module because this can account for a significant part of the FPGA resources. Bosi et al. [2] devised a 2-D convolver with multiplexed 1-D convolution modules or pixel interlacing strategy. These schemes sacrifice processing speed in order to reduce hardware consumption. A second category of methods replaces the multipliers with alternatives, such as shift-and-accumulation (SA) operations [4] , look-up table (LUT) [5] , and log 2 and inverse-log 2 approximations [6] . However, the total number of multiplications or substitute algorithms (MSAs) remains untouched in these multiplier-less implementations. Improved area efficiency was achieved by combining log 2 and inverse-log 2 approximations with folding operations in [7] . These folding operations rely on symmetry in the convolution masks, but mask symmetries vary greatly, particularly when the goal of convolution is to produce rich feature spaces [1] .
Recurrently decomposable (RD) filters [8] provide a feasible way of reducing MSAs for arbitrary 2-D filters. The complexity is reduced at the software level by separating the convolution mask into a series of convolutions using smaller masks. In this brief, the FPGA implementations of RD-based 2-D convolvers are studied. Three architectures adopting FB, SWPB, and MWPB buffering schemes are presented, and for each, we study bandwidth requirement and area utilization. The performance of the resulting FPGA implementations is compared with other state-of-the-art approaches. A metric that integrates both required bandwidth and resource utilization is proposed for selecting the most appropriate architecture to meet design considerations.
In Section II, we briefly describe the RD approach for 2-D filter implementation, the foundation of all convolution design improvements in this brief. The three proposed RD-based convolvers are provided in Section III, with the comparisons of FPGA implementations given in Section IV. Finally, Section V concludes this brief.
II. RD 2-D FILTERS
Conventionally, 2-D filters are defined either as separable or nonseparable according to whether they can be decomposed into the tensor product of two 1-D filters. In a previous work [8] , 1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
we showed that the computation efficiency of 2-D convolution can be improved if a 2-D filter can be separated into the convolution of two smaller 2-D filters. We termed filters that satisfied this as RD filters.
Consider the case where an R × S mask ω is decomposed into the convolution of a P × Q sized small mask ϕ and a J × K sized small mask ψ
where * denotes the convolution operation; we have
Thus, for each pixel, the total MSA saving is
which is of quadratic order in mask size. For example, if a 17 × 17 mask can be separated into the convolution of two 9 × 9 masks or a 16 × 2 mask and a 2 × 16 mask, the number of MSAs is reduced by 127 and 225, respectively. Comparing to the original number of 289 MSAs per pixel, this represents a significant (43.94% and 77.85%, respectively) saving in computation. These two cases correspond to the worst case and the best case in terms of improvement in computational complexity among all possible decompositions. In real applications, there will be a tradeoff between accuracy and efficiency. The decomposition is conducted by solving the unconstrained optimization problem
where (·) i,j denotes the kernel weight and ε is an acceptable upper bound on error tolerance based on the application. Practically speaking, the approximation error between the original kernel ω and the convolution of lower ones ϕ and ψ (i.e., ϕ * ψ) tends to be small, so we can often preserve improved efficiency with negligible effect on numeric performance. Readers are referred to [8] for more discussion on error analysis.
For a given 2-D convolution mask, we can select freely from all possible decompositions that meet the error tolerance requirement. Once the decomposition scheme (ϕ P ×Q and ψ J×K ) is determined, we seek an efficient FPGA implementation. Although ϕ or ψ can sometimes be further decomposed, we limit our discussion in this brief to one-layer decompositions, which is sufficient for illustration.
III. RD-BASED ARCHITECTURE DESIGN
In this section, we describe three FPGA architectures of RDbased 2-D convolvers, each supported by different buffering schemes. We provide comparisons of area utilization and bandwidth requirement with conventional schemes.
The RD-based architecture consists of a cascade of convolvers that correspond to the smaller sized masks, with buffering between them. In the proposed RD-based architectures (see Figs. 1-3) , convolvers I and II implement ϕ P ×Q and ψ J×K , respectively, where the size constraint in (2) is satisfied. All discussions are based on the scenario that an input M × N image is convolved with an R × S convolution mask without boundary extension. The output image will be (M − 2 × R/2 ) × (N − 2 × S/2 ) in size, where X denotes the largest integer less than or equal to X.
A. RD-Based FB Architecture
In the proposed RD-based FB architecture, pixels are fetched from external memories and buffered on-chip in convolver I as depicted in Fig. 1 . A first-in first-out (FIFO) with depth T is introduced to balance the clock frequency and data width between the external memory bus and the convolver. The FIFO in convolver II is saved, and the output of convolver I is buffered by a combination of J − 1 line-buffers, whose length is N − 2 × [Q/2] − K, and J sets of register arrays in convolver II. The J × K convolution in convolver II begins when these buffers are filled up.
The proposed RD-based architecture inherits the advantages of the FB scheme in that it only requires an external memory bus bandwidth of 1 pixel/clock. Comparing with the conventional FB scheme, which implements the R × S mask as a whole, there is a reduction of (P − 1) × (K − 1) + (J − 1) × (Q − 1) − 1 MSAs: a significant saving in hardware resources. There is also a slight saving of 2 × Q/2 × (J − 1) − 1 in the number of shift registers, which is negligible compared to the total resources used. 
B. RD-Based SWPB Architecture
The SWPB scheme [2] was proposed to reduce the number of buffers under a certain external memory bus bandwidth requirement. Unlike the FB scheme, the P × Q shift registers in convolver I directly receive data from FIFOs, and then, the convolution window moves to the next position with P pixels shifted in. An external memory of at least (N − 2 × Q/2 ) × J in size is adopted to store the interim convolution results of convolver I. These results are fed to a J × K SWPB convolver (convolver II) for further processing. On-chip resource reduction is obtained at the price of increased external memory bus bandwidth.
As compared with the conventional SWPB scheme [2] , (P −
1)×(K −1)+(J −1)×(Q−1)−1
MSAs and shift-registers are saved, but one more FIFO is needed. The resource difference in buffering pixels is negligible comparing with those consumed by the whole convolution module. Considering also the external temporary memory, the bandwidth demanded by the RD-based architecture is raised by 2 pixels/clock.
C. RD-Based MWPB Architecture
To balance the on-chip resource utilization and external memory bus bandwidth, the MWPB scheme [3] was proposed to reuse data that have already been stored in internal buffers. Fig. 3 illustrates the proposed RD-based MWPB architecture as a cascade of two MWPB convolvers. An external memory of at least (M − 2 × P/2 ) × K in size is utilized to store interim results, and as the processing is in column-major scan format, column-to-row scan format adaptation could be performed to ensure data consistency whenever necessary [3] .
Compared with the conventional MWPB scheme, an RD-based architecture reduces the number of required registers
There is one FIFO increase and a saving of (P − 1) × (K − 1) + (J − 1) × (Q − 1) − 1 MSAs. The external memory bus bandwidth that the RD-based MWPB architecture needed is (R + S + 1)/S pixels/clock. Table I summarizes the main features of different architectures, when the throughput is all fixed to 1 pixel/clock. Area utilization is measured by the total number of pixels the convolver needs to buffer and the amount of MSAs that the 2-D convolution kernel module demands, where T represents the depth of the FIFO. External memory bus bandwidth is given in terms of pixels/clock.
As shown in Table I , the RD-based architectures consume almost the same resources (depending on depth value T of FIFOs) in buffered pixels but have great benefits in saving MSAs for all three buffering schemes. In terms of bandwidth requirement, the RD-based FB architecture maintains the 1-pixel/clock external memory bus bandwidth. The 2/S slight increase hardly influences the bandwidth of the MWPB architecture (around 2 pixels/clock, if a square-shaped convolution mask is assumed), especially for larger mask sizes. For the RD-based SWPB architecture, the required bandwidth is raised by 2 pixels/clock. A tradeoff must be made between the MSAs saved and the increased bandwidth for small mask sizes (see more discussion on the architecture selection in Section IV). 
IV. PERFORMANCE ANALYSIS

A. FPGA Implementation
In this experiment, we consider an input image of size 1024 × 1024 which is to be convolved with a 17 × 17 mask. The FPGA implementations were realized on a low-cost XILINX Spartan-3 XC3S4000 FPGA using the Integrated Software Environment (ISE) 14.7. The data-width of each pixel in the input image was 8 bits, while the FIFO's depth was set to 8. MSAs were realized with multipliers. All of the FIFOs, line-buffers, multipliers, and adders were instantiated from the IP cores provided by ISE. Specifically, the FIFOs were implemented from shift registers, while line-buffers were implemented using RAM-based shift registers. All of these designs were written using Verilog and were synthesized using XILINX Synthesis Tool. Table II summarizes the hardware resource utilization in terms of slices for the input buffer module, the convolution kernel module, and the output buffer module, respectively. The resources decrease in the kernel module, and the proposed RD architectures are also given in percentage with respect to conventional ones, respectively. A throughput of 1 pixel/clock is assumed for all architectures.
A significant reduction in hardware utilization can be seen in Table II , for all RD architectures. For the best and worst cases of the three RD-based architectures, the real implementations show reductions in hardware resources required by the convolution kernel module to around 42% and 78%, respectively. They are very close to their theoretical values (43.94% and 77.85%), due to the amount of control logic used for the pipeline controls in different buffering schemes. The savings in resources can be up to 71.53% for the SWPB scheme; the benefit was smaller for the FB scheme (28.10% for the worst case). This is because the input buffers account for a large part in the complete design. Note that the benefits apply to all MSA types (e.g., SA [4] , LUT [5] , and log 2 and inverse-log 2 approximations [6] ), as their computational complexities all reduce proportionally with the number of MSAs. 
B. Architecture Selection
Bandwidth requirement is a key factor in architecture selection for buffering schemes. Generally speaking, the FB and MWPB architectures are widely adopted, while the high bandwidth requirement of SWPB architecture may not be satisfied in practical applications for a low-cost FPGA implementation.
To complete the architecture selection, we have to determine whether a recurrent decomposition should be performed for a specific design point. Prior performance metrics [3] , [9] do not incorporate the hardware consumption that is required by the convolution kernel module, which is a key issue that we address and which is reflected through proposing a performance metric.
Supposing that the maximum computing efficiency is achieved (throughput is fixed to 1 pixel/clock), we may define the optimum architecture as that which minimizes the area efficiency metric ξ where BW in denotes the required external memory bus bandwidth and N MSA denotes the total number of MSAs involved in the convolution kernel module. We consider N MSA as a good indicator of resource consumption of a candidate architecture, without going into further details about MSA type and different component libraries. The lower the metric value ξ, the more efficient the architecture is for a particular design point. Table III provides a comparison of the required bandwidth and number of MSAs for six architectures. X denotes the smallest integer greater than or equal to X. The results of the best and worst cases can be used as the upper and lower bounds for area efficiency improvements in hardware resources. Fig. 4 visualizes the proposed metric ξ on log scale with respect to varying convolution mask sizes for different buffering schemes, where three mask sizes are selected as representatives for convolution masks in three different size ranges; these correspond to small (3× 3, 5× 5, and 7× 7), medium (13 × 13, 17 × 17, and 21 × 21), and large (33 × 33, 41 × 41, and 49 × 49) mask sizes. As shown in Fig. 4 , RD-based FB architectures are always more efficient for all mask sizes. While for partial buffering architectures, the mask sizes should be greater than 5 × 5. This is in agreement with the qualitative analysis in Section III. Furthermore, the improvements in area efficiency brought by recurrent decompositions become more obvious with the increase of mask size, for all buffering schemes.
V. CONCLUSION
In this brief, we have presented three RD-based FPGA architectures adopting the FB, SWPB, and MWPB schemes, respectively. By recurrent decomposition, resources consumed by the convolution kernel module are greatly reduced. FPGA implementations demonstrate better area efficiency, especially for large convolution kernels. An area efficiency metric is suggested to guide architecture selection.
We have restricted this brief to considering the buffering schemes of the basic architectures of the RD-based convolvers. For substitute algorithms that improve the efficiency of convolution kernel modules [4] - [7] , the computational complexity and hardware consumption will shrink in proportion with the number of MSAs by applying a recurrent decomposition. Thus, these schemes can benefit from RD architectures.
