Abstract-In recent years deep learning algorithms have shown extremely high performance on machine learning tasks such as image classification and speech recognition.
I. INTRODUCTION
Deep learning algorithms have shown extremely high performance on machine learning tasks. In particular, convolutional neural networks (CNNs) have become the state-of-the-art for applications like computer vision and audio recognition [1] [2] [3] . To address the increasing demand for applications that require running neural network algorithms in real time on embedded devices, various high performance hardware platforms for discriminative CNN implementations have been proposed, including the use of distributed GPUs or customized accelerators like FPGAs and ASICs [4] [5] . In particular, FPGA-based accelerators have been proposed because they have lower latency and consume less power than GPUs while being more flexible and configurable than ASICs [6] [7] .
However, current FPGA accelerators focus on enhancing the performance of convolutional neural networks (CNNs), not deconvolutional neural networks (DCNNs). Unlike discriminative CNNs that effectively "downsample" the input to produce classification [1] , DCNNs are generative models capable of generating data by "upsampling" the input using deconvolution layers [8] . There are many applications of DCNNs, including multi-modal data modeling [9] , super resolution [10] and image-to-image translation [11] [12] (see Fig. 1 ). Such applications motivate us to design an FPGA-based accelerator with the ability to execute deconvolution operations with high throughput and low cost. [9] [10] [12] ).
There are several issues that must be addressed to design an FPGA-based deconvolution accelerator. First, a direct translation of CPU-optimized deconvolution algorithms to an FPGA will generally lead to inefficient implementations. A suitable adaptation of the deconvolution operation to a hardware substrate such as FPGA is therefore necessary in order to achieve high performance with low implementation complexity. In addition, although recent research shows that discriminative CNNs are robust to low bitwidth quantization [13] [14] , it is important to be able to systematically study the effects of such bitwidth reductions on the quality of inference from a generative model such as DCNN implemented with finite precision on FPGA. Thus it is necessary to use metrics which quantify the effects of such approximations in DCNNs in order to achieve an efficient design optimized for performance and power.
To address the issues described above, we make the following contributions in this paper. 1) We create a deconvolution accelerator with reverse looping and stride hole skipping to efficiently implement deconvolution on an FPGA, where our proposed solution, in a nontrivial way, reuses the same computational architecture proposed for implementing a convolution accelerator in [6] . 2) We propose a three-step procedure to design the deconvolution accelerator as follows. A) At the highest design level, we train DCNNs using the generative adversarial network method (GAN) [15] and use statistical tests to quantitatively analyze the generative quality under different bitwidth precisions to select the most cost-efficient bitwidth. B) We use the roofline model proposed in [6] to explore the design space in order to find the set of high-level constraints that achieves the best tradeoff between memory bandwidth and accelerator throughput. C) We use loop unrolling and pipelining, memory partitioning, and register insertion to further optimize performance. 3) We validate our procedure via two implementations on a Xilinx Zynq-7000 FPGA.
The rest of this paper is organized as follows: Section II provides background on the DCNN and the deconvolution layers. Section III presents our methodology for efficiently implementing an FPGA-based deconvolution accelerator. Section IV explains our three-step design methodology. Section V shows our experimental results. Section VI concludes the paper.
II. DECONVOLUTIONAL NEURAL NETWORK
A deconvolutional neural network (DCNN) converts latent space representations to high-dimensional data similar to the training set by applying successive deconvolution operations in multiple layers [16] . The latent space contains low-dimensional latent variables that provide a succinct ("conceptual") representations of the possible outputs (e.g. an image). Thus a latent variable may correspond to "chair" with the associated output being the image of a chair "generated" by the DCNN (see Fig. 1 ). Fig. 2 shows a 5-layer DCNN developed in [17] that consists of 4 deconvolutional layers. The first layer is fully-connected and transforms an input size of 1x100 to an output size of 1024x4x4; layers 2 to 5 are deconvolution layers that project low-dimensional feature maps into corresponding high-dimensional ones through successive layers. Fig. 2 . A DCNN that generates realistic 64x64 indoor scenes based on the use of four deconvolution layers that was trained on the Large-scale Scene Understanding (LSUN) Dataset [17] [18] (Image is taken and adapted from reference [17] .) Fig. 3 shows how a typical deconvolution layer works, where S and P denote the chosen values of stride and padding respectively for a given layer. The pseudo code of a deconvolution layer as implemented in CPU is shown in Algorithm. 1 which uses the loop variables defined in Fig. 4 .
By convention we use capital letters e.g. O H to denote specific parameters of the DCNN whereas small letters e.g. o h to denote its corresponding loop variable. Fig. 3 . Visualization of a Single Deconvolution Layer. The four steps required to implement the deconvolutional layer are: (1) multiply a single input pixel i h , iw by a K × K kernel; (2) add the result of step 1 to a local area in the output feature map that starts at i h × S, iw × S; (3) repeat 1 and 2 for all input pixels; (4) remove elements from output feature maps in the border by zero padding of size P . The relation of the input size I H × I W to output size O H × O W after applying stride and padding are given in the following equations [19] :
(1)
III. DECONVOLUTION HARDWARE DESIGN
An FPGA accelerator usually consists of processing elements (PEs), registers, and local memory elements referred to as block RAMs (BRAMs). Processing elements operate on data provided by the local memory, which communicates with external dual data rate (DDR) memory using direct memory access (DMA). Fig. 5 shows a traditional implementation of deconvolution, where 
Here the zero padding P = 0 because blocks are inside input feature maps. However, Eq. 3 shows that deconvolution results of input blocks overlap with each other:
Deconvolution arithmetic requires overlapping regions between output blocks to be summed together [19] which can be realized in processor-based implementations. However handling such operations in FPGAs requires either the design Algorithm 1 Deconvolution in CPU 1: procedure DECONVOLUTION 2:
for i c = 0 to I C − 1 do
3:
for i h = 0 to I H − 1 do
4:
for i w = 0 to I W − 1 do
5:
for o c = 0 to O C − 1 do 6:
for k w = 0 to K − 1 do
of additional hardware blocks which creates overhead or communicating with a host processor which can increase system latencies thereby precluding real-time applications. 
A. Reverse Looping
To avoid the overlapping sum problem, we propose a technique called reverse looping, where instead of directly deconvolving the input space, we use the output space to determine which input blocks to deconvolve and thus eliminating the need for the additional summation operations described above. This procedure is indicated in Fig. 6 . Fig. 6 . An efficient way to deconvolve. We first take a block in the output space and determine which inputs are needed to calculate the values in the block. Then, for each block, the input is deconvolved and the appropriate output is extracted. This is done sequentially until values have been computed for the entire output space.
The loop iterations over i h and i w in the CPU implementation shown in Algorithm 1 need to be recast over o h and o w . Referring to Algorithm 1 and Fig. 4 , we have:
Rearranging terms, we get:
Unfortunately Eq. 5 generally results in a non-integer value for the loop variable i h , which is invalid [19] . One way to address this problem would be to monitor i h so that fractional values can be discarded. However this would consume additional hardware resources and create unnecessary latencies in the system.
B. Stride Hole Skipping
In this section, we propose a technique called stride hole skipping to ensure i h of Eq. 5 is an integer. Toward this end, we recast o h in terms of two new variables, o h and f h and show that this leads to an effective way of solving the aforementioned problem. First note that a sufficient condition for i h to be an integer in Eq. 5 is:
S is an integer (O H is defined in Eq. 1), we can recast o h as follows:
Using the definition of o h in Eq. 6, we can recast the sufficient condition Eq. 6 in terms of f h as below:
Eq. 7 implies that we can rewrite f h as:
This can be verified by plugging in Eq. 9 into Eq. 8 which yields the following identity:
To prevent f h from taking a value equal to S, we enforce the additional condition:
By using Eq. 11 to choose values for f h , we can ensure that o h computed from Eq. 7 meets the condition in Eq. 6. Therefore we can avoid the previously mentioned issue of discarding fractional values of i h that we would otherwise encounter from a direct application of Eq. 5. The pseudo code for deconvolution on FPGA is shown in Algorithm 2.
Algorithm 2 Our FPGA Implementation of Deconvolution 1: procedure REVERSEDECONVOLUTION 2:
4:
for o h = 0 to
for o w = 0 to
for o c = 0 to
for i c = 0 to
IV. THREE-STEP DESIGN METHODOLOGY

A. Statistical Analysis
It is important to study the effect of bitwidth reduction on the quality of inference from the generative model. To find out the most cost-efficient bitwidth for DCNNs, we fix T O H , T O W , T O C , T I C , and study the trade-off between generative quality and implementation complexity over a range of bitwidths using statistical analysis. Quantifying generative models using traditional techniques such as Kullback-Leibler divergence and log-likelihood are not feasible in high-dimensional settings such as the typical setting deconvolutional neural networks are used in. To overcome this drawback, we apply nonparametric goodness of fit testing. Specifically, we apply the Relative Maximum Mean Discrepancy (RMMD) Test proposed by [20] to measure and compare the performance of our system at different bitwidths.
The RMMD is an extension of the Maximum Mean Discrepancy (MMD) two sample test proposed by [21] . Given samples
and {Y i } n i=1 from distributions P x and P y the MMD test statistic is given by:
the null hypothesis H 0 : P x = P y is tested versus alternative H 1 : P x = P y . In the above equation, k is the Radial Basis Function given by
The RMMD test builds upon the standard MMD framework by computing the MMD test statistic between two pairs of distributions. Given samples
, and
respectively from the training data, low-bitwidth DCNN, and full-precision DCNN, RMMD tests the null hypothesis
. [20] shows that the p-values for testing H 0 against H 1 are given by:
where Φ is the Normal Cumulative Distribution Function. The p-value in the above equation indicates the probability that, based on the observed samples, the distribution based on the low bitwidth DCNN is closer to the training data than the distribution based on the full precision DCNN is to the training data. Using this interpretation:
• a p-value > 0.5 indicates the low bitwidth DCNN is more similar to the training data
• a p-value < 0.5 indicates the full precision DCNN is more similar to the training data
B. Roofline Analysis
The generative quality is determined by choosing the optimal bitwidth using the previously described procedure. Following this we turn to further increasing the throughput by optimizing with respect to T O H , T O W , T O C , and T I C , which are the height, width, channel size of output block, and channel size of input block respectively (see Fig. 5 ). This is done using roofline analysis [6] . Fig. 7 shows an example roofline plot where the X axis denotes the number of operations per memory access and Y axis denotes the number of operations per cycle. Fig. 7 . Roofline Model, adopted from [6] In this drawing, A, B and C correspond to designs of accelerator with different values of
Design A transfers too much data, so computation speed is low, and therefore falls well beneath the computation roof. Design B lies well beneath the bandwidth roof, which means the system performance is dominated by memory transfers. Design C is more efficient than A and B with its balance between computation speed and memory bandwidth. This technique is described in [6] and is used for the design of convolution accelerator. We apply roofline analysis to design deconvolution accelerator and estimate the computation to communication ratio (CTC) and computational roof (CR) for a given layer.
1) Computation to Communication Ratio:
Let α in , α w , α out and B in , B w , B out denote the trip counts and buffer sizes of memory accesses to input/output feature maps, weights, respectively. The CTC is given by: CTC = total number of operations total amount of external memory access
2) Computation Roof: Let PD denotes the pipeline depth and II is the number of cycles between the start of each loop iteration T O W , the CR is given by: CR = total number of operations number of execution cycles
where
will not hold true when the bitwidth is greater than 18, because the maximum bitwidth of the multipliers used in our implementation is 18-bit [22] . Since we use a bitwidth of 12 in all our experiments this constraint is therefore valid.
C. VLSI Level Optimization 1) Loop Unrolling and Pipelining: Loop unrolling is a key technique of high level synthesis [23] . It works by generating parallel hardware to accelerate FPGA program execution. The innermost loop T O C and T I C in Algorithm 2 are unrolled and can be executed in a constant amount of cycles P , which forms the processing engine as shown in Fig. 8 . We also pipeline the loop T O W with carried dependency of 2. 2) Register Insertion: The critical path length and pipeline interval are constrained by the on-chip local memory bandwidth, especially when the size of the processing engine is large. To further improve performance, we insert registers to economize local memory bandwidth, which is illustrated in Fig. 9 . 
A. Statistical Analysis
Previous work such as that described in [24] has shown the effectiveness of using high-dimensional nonparametric tests to determine optimal parameters for generative inference in hardware. For designing the deconvolution accelerator we follow a similar approach and use the RMMD test framework outlined in Section IV A to choose the optimal bitwidth for our system. For this purpose, we trained two DCNNs through the method described in [17] on the MNIST and CelebA Human Face datasets [25] . To study the trade-off between generative quality and system complexity over a range of bitwidths, we determine p-value × minimum slack and p-value/power as a function of bitwidths. The two curves are shown in Fig. 10 . Both curves peak at bitwidth 12, which we take to be a good choice because it represents a high p-value (generative quality) with a low power consumption and high minimum slack. 
B. Hardware System
We implemented the deconvolution accelerator IP with Vivado HLS (v2016.2). We use ap fixed.h from Vivado Math Library to implement fixed point arithmetic operations with arbitrary bitwidth precision, and use hls stream.h & ap axi sdata.h to model streaming data structure. The hardware system is built on a Zynq-7000 FPGA XZ7020 with Vivado Design Suite and Xilinx SDK. The FPGA 7Z020 is programed with our accelerator IP and the ARM processor is used to initialize the accelerator, set parameters, and transfer data for each layer. An overview of the implementation block diagram is in Fig. 11 . Fig. 12 shows some generated faces and digits from our trained DCNNs. Fig. 13 shows the output of DCNNs under different bitwidths for the same input. Visually evaluating degradation of image quality is only feasible in the cases of extremely low bitwidth such as 8 bits. Our proposed methodology provides an analytical framework for quantifying the trade-off between image quality and implementation complexity over a range of bitwidths. shown as located at the left corner of the roof. Table I shows the utilization rate after place and route, and we compare our DCNN performance with some existing CNN accelerators for reference in table II. The performance can be further improved by implementing a ping-pong buffer in our system. VI. CONCLUSION
C. Experimental Results
In this work, we develop an FPGA-based deconvolution accelerator for deconvolutional neural networks and propose a three-step design methodology which first uses statistical analysis to find out the most cost-efficient bitwidth, then explore the design space with roofline model [6] and use VLSI optimization methods to produce the final design. Finally, we implement our method on a Zynq-7000 FPGA and realize a performance density of 0.012 GOPs/DSP.
