In recent years, FPGA-based CNN accelerators have been proposed for optimizing performance and power efficiency. Most accelerators are designed for object detection and recognition algorithms that are performed on low-resolution images. However, real-time image super-resolution (SR) cannot be implemented on a typical accelerator because of the long execution cycles required to generate high-resolution (HR) images, such as those used in ultra-high-definition systems. In this paper, we propose a novel CNN accelerator with efficient parallelization methods for SR applications. First, we propose a new methodology for optimizing the deconvolutional neural networks (DCNNs) used for increasing feature maps. Second, we propose a novel method to optimize CNN dataflow so that the SR algorithm can be driven at low power in display applications. Finally, we quantize and compress a DCNN-based SR algorithm into an optimal model for efficient inference using on-chip memory. We present an energy-efficient architecture for SR and validate our architecture on a mobile panel with quad-high-definition resolution. Our experimental results show that, with the same hardware resources, the proposed DCNN accelerator achieves a throughput up to 108 times greater than that of a conventional DCNN accelerator. In addition, our SR system achieves an energy efficiency of 144.9, 293.0, and 500.2 GOPS/W at SR scale factors of 2, 3, and 4, respectively. Furthermore, we demonstrate that our system can restore HR images to a high quality while greatly reducing the data bit-width and the number of parameters compared with conventional SR algorithms.
I. INTRODUCTION
R ECENTLY, object detection [1] - [3] , recognition [4] - [6] , and natural language processing [7] have attracted consider-able attention because of the emergence of convolutional neural networks (CNNs). As a result, extensive Comparison of the computational complexity between AlexNet composed of five convolutional layers and FSRCNN composed of seven convolutional layers and one deconvolutional layer, where the convolu-tional layers are from C1 to C7, and the deconvolutional layer is DC1. studies on CNN accelerators have been conducted in order to implement CNN algorithms in real-time systems. In particular, in hardware implementation, FPGA-based CNN accelerators are more energy-efficient than those based on GPUs and can perform more massive parallel processing than those based on CPUs [8] . In addition, compared to ASICs, FPGAs are more flexible enough for handling the rapid evolution of CNNs [9] , [10] .
Most CNN accelerator-related studies [8] - [22] have been focused on object detection and recognition applications. In recent years, research studies on image super-resolution (SR) using CNNs have been attracting considerable attention, because CNN-based methods can reconstruct images with a higher peak signal-to-noise ratio (PSNR) than conventional methods [23] - [25] . However, since most CNN accelerators are designed for object detection and recognition algorithms, the following problems can occur when SR algorithms are implemented in a typical accelerator.
First, SR requires a considerably higher resolution image input than object detection and recognition algorithms to generate full-high-definition (FHD), quad-high-definition (QHD), and ultra-high-definition (UHD) videos for mobile applications or TV services. Fig. 1 shows a computational complexity comparison between AlexNet [4] and FSRCNN [25] , which is a well-known deep neural network (DNN)-based SR Fig. 2 . Three levels of general hardware DCNN accelerator hierarchy: 1) off-chip memory; 2) on-chip memory; 3) processing elements (PEs). Unlike the CNN accelerator, DCNN accelerator has an overhead to read the outputs stored in the off-chip memory due to the overlapping sum problem. In this figure, the red arrows indicate additional read operations.
algorithm. Most object classifiers operate on input images with a pixel resolution of less than 256×256 [4] - [6] . Since the resolution used in object classifiers is less than that used in SR, FSRCNN requires 38.82 times more multiplyaccumulate (MAC) operations than AlexNet when generating UHD images.
Secondly, recent DNN-based SR algorithms, including FSRCNN, use deconvolutional neural networks (DCNNs) [26] at the end of the entire network to reconstruct highresolution (HR) images from low-resolution (LR) images. The deconvolu-tional layer has the highest computational complexity; it uses a maximum of 6.75 times more MAC operations than convolutional layers, as shown in Fig. 1 . Moreover, unlike CNNs, DCNNs create up-scaled output blocks in terms of the kernel size and accumulate pixel values within the output blocks generated from the neighboring pixels.
As shown in Fig. 2 , when DCNNs are implemented in hardware, additional operations are required that aren't required in CNNs to load the previously obtained output pixels in the memory, update the pixel values in the processing elements (PEs), and store them in the memory. This is called the overlapping sum problem [27] , [28] .
The conventional DCNN accelerator [27] attempts to solve this problem by using formulas to locate the positions of the input pixels needed to generate the output pixel through a reverse looping method. A deconvolutional layer processor (DCLP) was designed to perform parallel operations based on the size of tile parameters by applying loop optimization techniques [10] that remove the dependence of loops. However, the reverse looping method, which requires additional loads before each PE, has a large hardware overhead due to limited resources and is not energy efficient. Additionally, the conventional DCNN accelerator does not optimize the high computational complexity of the deconvolutional layer.
There is a method of generating HR images from LR-sized feature maps through a sub-pixel convolutional layer [29] . This layer performs the same operation as the convolutional layer but combines the LR sized output feature maps into a HR image. However, since zero-weights do not exist in convolution filters, a dense CNN accelerator is required. Therefore, the sub-pixel convolutional layer is inefficient in high complexity SR applications.
In general, when implemented with on-chip memory, it is difficult to store all the data required for CNN-based algorithms, except with binarized feature maps [19] ; this is because of the large size of the 3D feature maps obtained each time the convolutional layer is processed. Therefore, most FPGA-based CNN accelerators use an off-chip memory and perform off-chip data transfer and computation simultaneously through ping-pong operations. As a result, MAC operations can be performed continuously through a convolutional layer processor (CLP) using loop optimization techniques. Even if ping-pong operations are applied through double buffers, a CLP cannot perform the subsequent operation until a large number of output feature maps are stored in the off-chip memory. In this case, the performance of the accelerators is degraded [17] , [18] .
In previous papers, CNN fusion architectures were proposed for reducing a large amount of off-chip data transfers [17] , [18] . Fusion architectures are designed with various CLPs for processing multiple convolutional layers. Therefore, the data generated after each CLP is operated are transferred to the next layer processor using the on-chip memory. The offchip data transfer only occurs on the first and last fused layers. However, CNN fusion architectures still require communication with off-chip memory, which is not energy-efficient.
In order to reduce the bandwidth between the accelerator and off-chip memory, Brainwave [30] stores the DNN model and intermediate data in on-chip storage to simplify the offchip interconnect. This requires quantization and compression of the DNN model. However, Brainwave is limited in that it does not support DCNN.
In this paper, we propose a novel SR-based DNN accelerator for real-time HR image generation with efficient dataflow. The main contributions of this paper are as follows.
• We propose a novel DCNN accelerator that can be massively parallelized by transforming the deconvolu-tional layer into the convolutional layer (the TDC method). We identified a load imbalance problem during the convolution process executed by the TDC method in our previous work [28] . To overcome this problem, we propose a new load balance-aware TDC method that increases the efficiency of sparse matrix multiplication.
• We propose a dataflow for hardware acceleration to store the intermediate data between the layers using the on-chip memory.
• We quantize and compress a representative DCNNbased SR algorithm, called FSRCNN, into an optimal model for efficient inference using on-chip memory. If we design other SR algorithms, the same optimization process can be done to be implemented in onchip memory. We present an energy-efficient DNN-based SR system. Our system achieves an energy efficiency of 144.9 GOPS/W, 293.0 GOPS/W, and 500.2 GOPS/W for SR scale factors of 2, 3, and 4, respectively.
The rest of this paper is organized as follows. Section II gives an overview of the CNN and DCNN algorithms. Section III describes the proposed methodology for the DCNN accelerator. Section IV presents the proposed hardware architecture for SR systems and the details of the hardware implementation. Section V presents the experimental results compared to Table I shows the parameter notations used in the convolutional and deconvolutional layers. Table II shows key abbreviations used in this paper. Fig. 3(a) shows the convolutional layer constituting the CNN structure. The convolutional layer receives input feature maps, which are arranged in three dimensions, H in ×W in ×N . Then, it creates output feature maps, which are the results from the input feature maps obtained using learned weights. The process of generating output feature maps is as follows. First, input blocks that move by a defined stride in the input feature maps perform convolution with weights. The kernel size is K C × K C , the number of the kernels is M × N , and the stride is S. To create M output feature maps, all the N outputs generated by the same type of convolution filter are added together with biases. Finally, the activation function [31] transforms the outputs of the three-dimensional convolution. Fig. 3(b) illustrates the deconvolutional layer that comprises the DCNN. The deconvolutional layer moves the sliding window at stride intervals in the output feature maps rather than in the input feature maps. The output size is K D ×K D . Therefore, as shown in Fig. 3(b) , the overlapping sum problem, where the output blocks are overlapped with the neighboring output blocks, occurs in the green and blue regions. The outputs located in the green region can be easily updated using on-chip buffers because they do not overlap vertically with adjacent output blocks. On the other hand, outputs located in blue region must be called from the memory whenever they overlap with vertically adjacent output blocks. Consequently, it is difficult to store large amounts of intermediate data in on-chip memory to update the previously generated outputs. Unless the final outputs are no longer overlapping with neighboring blocks, the processor must read the output that is already stored in memory and update and store it again. This inefficient dataflow interferes with the ping-pong operation, which can overlap the computation of the processor with the data transfer time [10] .
II. BACKGROUND

A. Convolutional Neural Networks
B. Deconvolutional Neural Networks
III. PROPOSED DCNN ACCELERATOR
A. TDC Method
Each pixel in the input feature map generates an output block of K D ×K D through deconvolution. However, there is a problem of overlapping with the output blocks generated from neighboring input pixels. Fig. 4 shows examples where output blocks generated from adjacent inputs overlap with each other. We must add all the overlapping areas every time the input pixels perform 2D deconvolution. To avoid this overlapping sum problem, we must determine the number of input pixels required to generate an output block that no longer overlaps.
Each output block can overlap with adjacent output blocks within a range of K D /2, as shown in Fig. 4 . Since the value of the stride S is always greater than 1 for up-scaling, the input pixels that are mapped to the output feature map, which are depicted as red bounding boxes, are spaced apart by S. Thus, the value N O , which indicates how many horizontal (or vertical) neighboring blocks overlap within K D /2 can be calculated as
The fractional value of N O determines how the current output block overlaps with the most distant output block. Fig. 4 shows a comparison of cases where the integer values are the same but the fractional values differ. As shown in Fig. 4(a) , when the fractional value of N O is less than 0.5, the top leftmost output block overlaps two neighboring output blocks within K D /2, but does not overlap with the adjacent third output block on the same line. Conversely, if the fractional value is greater than 0.5, the top leftmost output block overlaps all three neighboring output blocks, as shown in Fig. 4(c) . Considering both possible cases, the size of the input block K C ×K C that can produce non-overlapping output with the adjacent output block can be determined as
Using the property that output blocks are spaced apart by S in the deconvolutional layer, the The computation process of producing each output pixel consists of MAC operations between input pixels and weight coefficients of the deconvolution filter. In this process, there is a new source of massive parallelism because there is no data dependency in creating output pixels. Specifically, each output pixel can be generated from the convolution between the K C × K C input blocks and the convolution filters equal to the size of the input block. As shown in Fig. 5 , we apply the new source of the parallelization in the hardware implementation. Through the TDC method, the pixels in the S × S output block can all be created on the same timeline. Specifically, we convert the spatial domain in the HR output feature map to generate each pixel in the S × S output block separately in different channels. This increases the number of output feature maps by S 2 times. Likewise, we apply the TDC method for all feature maps.
B. Inverse Coefficient Mapping
We now describe the acquisition of the weights of the newly created convolutional layer through the TDC method. We declare (x i , y i ), (x d , y d ), and (x o , y o ) as the indices of an input pixel, a weight coefficient of a deconvolutional layer, and an output pixel, respectively. The range of each pixel is defined as
For mapping the weights of the deconvolutional layer to those of the convolutional layer, we propose an inverse coefficient mapping to find (x d , y d ) corresponding to (x i , y i ). Fig. 6 shows an example of when the inverse coefficient mapping is performed. We must find the indices of the weight coefficient corresponding to the input pixel, which is a red rectangle, to produce an output pixel represented by a green bounding box. The overall process for inverse coefficient mapping is as follows.
As shown in Fig. 7 , we divide the inverse coefficient mapping into two processes. First, we obtain the relative position (x r , y r ) using (
. This is because input pixels are shifted by the stride S in the output feature map to produce output blocks. However, since output blocks are created as two types according to the fractional value of N O , their relative position depends on this value. Fig. 7(a) shows that, if the fractional value of N O is less than 0.5, the relative position is point A. In contrast, Fig. 7(b) shows that, if the fractional value of N O is greater than 0.5, Fig. 6 . Example of when inverse coefficient mapping is performed. By the inverse coefficient mapping, we obtain the indices of the weight coefficient when given the input and output pixels respectively located in the red square and the green bounding box. The blue bounding box represents the S × S output block. 
Next, we subtract the offset to select the weight coefficient for one of the output pixels as shown in Fig. 7 . We calculate the indices of the weight coefficient corresponding to the input pixel as
Finally, W D , which represents the weights of the deconvolutional layer, is mapped to the weights of the newly created convolutional layer W C using Eq. (4) and Eq. (5) as follows.
where m and n are indices for loops of the output and input feature maps with ranges of 1 ≤ m ≤ M and 1 ≤ n ≤ N , respectively. However, if (x d , y d ) exceeds the range of the indices based on Eq. (3), the weight coefficient becomes zero, thereby producing a zero-valued element, which will be explained in the next sub-section. Therefore, we show that M × N deconvolution filters are classified as S 2 kinds of convolution filters according to the indices of output pixels, as shown in Fig. 8(a) . Since S is set to 2 in Fig. 8(a) , there are four kinds of sparse convolution filters.
C. Zero-Aware Processing Element
Our TDC method maps one deconvolution filter with size K D ×K D to S 2 convolution filters with size K C ×K C in each input and output feature map according to Eq. (6). However, they have different sizes because some weights of convolution filters are filled with zero-valued elements. The total number of zero-valued elements, num zero , in the transformed convolution kernels is derived as
Table III shows the ratio of zero-weights in the convolutional layer generated by the TDC method. The ratio varies according to K D and S. Moreover, it can be seen that K C obtained from the TDC method is always smaller than K D .
The TDC method efficiently creates output pixels by reducing the kernel size of the weights. However, the load imbalance problem occurs because of the zero-weights, which is demonstrated by Table III . This is because the distribution of W C is different for each output pixel. Fig. 8 shows the dataflow of the proposed DCLP. Before the DCNN inference, we first convert the deconvolution filter to convolution filters offline using the TDC method, as shown in Fig. 8(a) . We apply a load balance to adjust the proportion of zero weights within each filter. To output the same result as before, we store the output index, the address of the output buffer, in memory along with the weights. We conduct all of the above steps offline. Next, we minimize the execution cycles by evenly distributing non-zero weights across the PEs. Therefore, the total idle cycles of different PEs are reduced. Fig. 8(b) shows the DCLP architecture with sparse input activations and weights. Since the positions of the zeroweights in all the filters are always the same, the positions of the zero inputs are also determined at the same position. As a result, our proposed DCLP exploits both input activation and weight sparsity to balance the load. Finally, we parallelize the input and output feature maps using the loop optimization techniques to make a comparison with the conventional DCNN accelerator in the same environment. DCLP performs MAC operations T m × T n times with inputs and weights. (T m and T n are tile sizes for the number of output/input feature maps to parallelize the process)
Until the operation of each filter is completed, the intermediate outputs should accumulate in the previous outputs stored in the buffers. Hence, the results of the reordered weights accumulate in the output buffers via the output index, as shown in Fig. 8(b) . Fig. 9 shows the performance benefits achieved through the proposed load balance-aware TDC method over other methods. In this case (K D = 9, K C = 5, S = 2), we used four PEs. Fig. 9(a) shows the performance of the conventional DCNN accelerator. Fig. 9(b) shows the performance degradation of each PE caused by the load imbalance in our previous work. To efficiently process MAC operation in parallel, we propose the load balance-aware TDC method as shown in Fig. 9(c) . In Fig. 9(b) , PE0 contains nine nonzero weights, whereas PE3 contains four non-zero weights. The pipeline stage is determined by PE0, which has the most computational complexity. However, the position of zeroweights is always the same for each deconvolutional layer because the inverse coefficient mapping applies equally to all kernels. Therefore, we design the accelerator to perform load balancing offline, as in Fig. 8(a) . The execution cycles of the deconvolutional layer are Execution cycles As shown in Fig. 5 , the convolutional layer created using the TDC method produces output feature maps of the same size as the input feature maps, but the number of output feature maps increases by S 2 times. The increased number of output feature maps and the input feature maps are processed in parallel by T m and T n , respectively. Therefore, as opposed to the conventional DCNN accelerator, there are three different cases of perfor-mance enhancement depending on the range of M ; these are demonstrated in Fig. 10 . A visualization of the total computa-tional complexity at the deconvolutional layer is shown in Fig. 10(a) . Fig. 10(b) shows the hardware size for the DCLP. In the rectangular parallelepiped, the width and height of the bottom surface represent the size of the tiling parameters Fig. 11. FSRCNN network structure. For simplicity, we express FSRCNN  as FSRCNN(x, y, z) , a combination of sensitive variables x, y, and z. (In [25] , x, y, and z are set to 56, 12, and 4, respectively.)
for the kernel and image, respectively. Because the DCLP performs parallel processing on the input and output feature maps, both lengths are set to 1. Fig. 10(c) shows a visualization of the difference in the performance of the conventional and the proposed method. Both methods are executed on the same DCLP. The three abovementioned cases that are dependent on M are as follows.
Case 1 (M ≤ T m /S
2 ): Our method unrolls entire loops for the output feature maps. In addition, we improve the resource underutilization problem, where idle hardware exists in the DCLP, and reduce the convolution cycle by generating LR instead of HR images. The performance enhancement is
Case 2 (T m /S 2 < M ≤ T m ):
Our method completely solves the resource under-utilization problem by activating all T m − M hardware resources that are in the idle state. In this case, the performance enhancement is
Case 3 (M > T m ):
Our method cannot process more output feature maps in parallel than the existing DCNN accelerator. However, the execution speed is higher due to the reduced kernel size. The performance enhancement in this case is
IV. PROPOSED DNN-BASED SUPER-RESOLUTION SYSTEM
In this section, we propose a methodology to achieve an energy-efficient architecture for implementing the state-of-theart DNN-based SR algorithm, FSRCNN. Fig. 11 shows the network structure of the FSRCNN. We express the convolutional layer and the deconvolutional layer as Conv(K C , M , N ) and DeConv(K D , M , N ), respectively. Using the TDC method, we regard the deconvolutional layer as a convolutional layer. For example, Fig. 11 , the reason for denoting variables such as x, y, and z is that they are sensitive variables that determine overall performance in FSRCNN [25] . For simplicity, we represent FSRCNN models with different sensitive variables as FSRCNN (x, y, z) . In the conventional FSRCNN model, x, y, and z are experimentally set to 56, 12, and 4, respectively. PReLU [32] is used as an activation function in the FSRCNN.
A. Dataflow Optimization With On-Chip Memory
In our system, which does not use off-chip memory, the FPGA receives pixel values through the display driver in the horizontal direction of the frame. If all the convolutional layers are executed by a single CLP, the pixel data coming into the FPGA must be stored in the on-chip memory until the last layer is completely processed. Consequently, several frame buffers may be required, depending on the execution time of the CLP. In order to solve this problem, we process the input pixel data by designing all the convolutional layers to run concurrently through multiple CLPs, as in fusion architectures. In this case, multiple CLPs must be designed such that they can handle on-chip dataflow efficiently. To find the loop tiling parameters for multiple CLPs, we compare the execution cycles of each CLP with the transmission cycles of the pixel data coming from the display driver or CLP.
The computation to transmission ratio is defined as the ratio of the cycles required to perform the CLP of the l th convolutional layer to the cycles required to transmit the pixel data from the display driver or input buffers to the l th CLP. Since each CLP must perform several feature maps at the same time, pixel data is sent to the CLP as much as the tiling factor. The CLP performs 2D convolutions on the feature maps with the received pixel data. If the tile sizes for processing the l th convolutional layer are given by T Table I , the computation to transmission ratio is calculated as
Computation to Transmission Ratio
=
Execution cycles of l th CLP total number of transmission cycles
If the computation to transmission ratio is greater than 1, the number of transmission cycles in the l th layer is lower than the number of execution cycles in the CLP. Hence, the data transferred during the computation must be stored in the frame buffer. For example, if an output image with UHD is generated using an SR algorithm with a scale factor of 2, an approximately 8.1 MB buffer memory is required to store an input image with a 1920 × 1080 resolution in the 32-bit floating-point data type. Furthermore, considering the size of the input feature maps, the required memory can exceed the allowable on-chip memory of a typical FPGA. For this reason, we set the computation to transmission ratio of all layers to a value of 1 in order not to use the frame buffer. Hence, Eq. (9) A memory management technique is required to efficiently store the feature maps generated by multiple CLPs. Since the pixel data from the display driver are transmitted line by line, we use a line buffer that can reuse the data without being restricted by boundary conditions [18] . The line buffer is designed as a block RAM (BRAM), which is a simple dualport mode [33] in which one read and one write are allowed concurrently in order to fulfill both the input and output buffer of each CLP. The capacity required to implement the line buffer should be same as the size of the 3D data generated from the CLP. The number of line buffers must be equal to the width of the kernel size for the convolution. The size of the line buffers for each CLP is calculated as
FSRCNN has two 1 × 1 convolutional layers as shown in Fig. 11 . We unroll all three convolution loops to avoid frame buffering, and therefore, the outputs of the neurons become the inputs of the next neurons without the need to accumulate with the outputs of other neurons. In this manner, the output feature maps generated by the convolutional layer in front of the 1 × 1 convolutional layer can be directly sent to the CLP of the 1 × 1 convolutional layer. Thus, we connect the CLP of the layer ahead of the 1 × 1 convolutional layer and the CLP of the 1 × 1 convolutional layer without the line buffer. We call this type of processor a combined CLP.
According to a guideline [33], a 7 series FPGA BRAM18kb unit can store 512 32-bit words. Hence, the number of BRAMs required to generate a UHD image with FSRCNN (56, 12, 4) is 1609. Although the line buffer for the CLP in the 1 × 1 convolutional layer is removed, it is larger than that used in a typical FPGA. Thus, it is necessary to reduce the usage of BRAMs for the utilization of embedded systems.
B. Quantization
Fixed-point implementation mitigates the complexity of hardware design and potentially enables the use of embedded hardware [34] . In particular, our CLPs have three dimensions to process three convolution loops in parallel. Therefore, the implementation of hardware with a fixed-point increases resource utilization efficiency. Fig. 12 shows PSNR according to data bit-width in the representative datasets, Set5, Set14, and B100. Fig. 12 shows that the PSNR decreases dramatically when the bit-width is smaller than 13-bit, while the performance is maintained when the bit-width is larger than 13-bit. This is because the mantissa expressing the fractional value is sufficiently accurate even at low bit-width [35] . To optimize the utilization of the FPGA resources more compactly, pixels, weights, and partial sums were reduced from 32-bit floatingpoints to 13-bit fixed-points using the bit-width quantization technique [35] . By quantizing the bit-width, we reduced the number of BRAMs from 1609 to 654 when implementing  FSRCNN (56, 12, 4) .
C. Compression
In Xilinx FPGAs, a DSP48E1 block [36] performs up to 25 × 18-bit multiplication. We must design a 13 × 13-bit multiplier for convolution on low bit-width data. As a result, there is a problem that resources in the DSP are not sufficiently activated. To solve this problem, we use a double MAC [37] , which performs two multiplications on a common operand with a single DSP. The maximum possible bit-width for each operand is 8-bit. We split the 13 × 13-bit multiplication into 8 × 8-bit multiplication, 13 × 5-bit multiplication, and 5 × 8-bit multiplication and then sum the results. Because CNN iteratively executes multiple operations on the same input feature map, double MAC improves the efficiency of DSP usage. The resource requirement for two 13 × 13-bit multiplication is 1 DSP + 124 LUTs + 124 FFs. Due to the high logic element usage of the double MAC, we also considered designing a multiplier with a single DSP called single MAC. The total number of DSPs required in the design of the multiple CLPs can be obtained by
α is a parameter that determines the ratio of double MAC. A large value of α can increase image quality, but it lowers power efficiency because a large amount of resources is required [38] . Thus, we experimentally set α to 0.7 considering this trade-off. L is the total number of layers. The number of DSPs required to implement FSRCNN (56, 12, 4) is 8,102 when α = 0.7. This requirement is higher than the total number of DSPs embedded in high-end FPGAs. For this reason, we compress the FSRCNN into an optimal model for efficient inference.
We change the sensitive variables of FSRCNN considering the total number of DSPs in the target FPGA specification. We set the target FPGA to the Kintex-7 410T FPGA, which is a cost-effective digital processing platform. Since the number of parameters used in the deconvolutional layer is approximately 50% of the total due to large kernel size, we reduce the kernel size of the deconvolutional layer from 9 × 9 to 7 × 7 so that sensitive parameters can take larger values. Table IV shows the average PSNR on the Set5 dataset and resource usage for various sensitive variables when the scale factor is 2. We train the models offline via the GPU with the Caffe framework [39] . Due to a lack of hardware resources, sensitive variables cannot be large values. When the number of parameters is fixed, we see that performance is better when z is smaller. This trend also appears in the convergence curves shown in Fig. 13 . This is because if z is large, x is too small for the convolutional layer to extract enough local features for reconstruction.
As shown in Table IV , the model with the highest PSNR is FSRCNN (25, 5, 1) . This model is also resource efficient because it has the lowest BRAM usage. As a result, we implement FSRCNN (25, 5, 1) , which is a light version of FSRCNN, in hardware. Fig. 14(a) shows the proposed on-chip memory-based FPGA architecture. The FPGA receives an input LR image with RGB channels and uses the Y channel after performing RGB-to-YCbCr conversion. In general, in DNN-based SR systems, the Cb and Cr channels are rarely used for learning [25] . Thus, we up-scaled the Cb and Cr channels with bicubic interpolation.
D. Hardware Implementation
Before entering the CLP, the pixel data are stored in the line buffers. After K l C − 1 lines are stored, the outputs of the line buffers and incoming data enter the CLP; then convolution is performed with the filters from the weight buffer. Fig. 14(a) shows that the CLPs of the first layer and the second layer are fused into the combined CLP1. In addition, the CLPs of the third layer and the fourth layer are fused into the combined CLP2. By using the combined CLPs, we could reduce the total amount of line buffers to 83%.
Figs. 14(b)-(e) show the computation engines in the PE constituting the CLP of the third layer. First, the PE fetches the data stored in the line buffer into the registers and performs multiplication with weights. We perform all the multiplication operations within the kernel at the same time and process these operations M l × N l times in parallel as depicted in Fig. 14(b) . Then, we add the outputs of the multiplication through the adder tree, as shown in Fig. 14(c) . Fig. 14(d) shows the process of adding the results of each input feature map through the add engine consisting of M l adder trees. Finally, Fig. 14(e) shows the output of the neurons via the PReLU activation engine.
V. EXPERIMENTAL RESULTS
A. Evaluation of the Proposed DCNN Accelerator
We validated the DCNN accelerators with the Xilinx Virtex-7 485T FPGA in the same experimental environment as that used by the study in [28] . We implemented the proposed and the conventional architecture using Vivado HLS 2016.4 and a single-precision floating-point. We evaluated the DCNN models FSRCNN and DCGAN [40] on the hardware used in previous studies [27] , [28] . To compare the performance of the proposed DCNN accelerator in the same experimental environment as the conventional DCNN accelerator, we designed the accelerator using the single CLP method [10] . The conventional DCNN accelerator paralleled the convolution loops for output feature maps and input feature maps with T m and T n , respectively, and determined the optimal tiling parameters through the roofline model [41] . Fig. 15 shows possible design space solutions when designing the CLP for the fourth layer of the FSRCNN by means of the roofline model. The computation to communication ratio is the number of operations performed per external memory access. Therefore, in order to utilize all possible hardware resources and minimize the bandwidth in off-chip memory communications, we chose the optimal solution for each layer, as depicted in Fig. 15 . Then, we performed cross-layer optimization [28] . We set the tiling parameters (T m , T n ) for the FSRCNN and DCGAN to (56, 9) and (4, 128), respectively. Table V shows the performance comparison of the existing method and the proposed method. The performance analysis of the proposed accelerator for DCGAN and FSRCNN is as follows.
First, DCGAN consists of four deconvolutional layers. Each layer has a greater number of input feature maps than output feature maps. Thus, T m was set to 4, which is 42 times smaller than T n . Our load balance-aware TDC method improved performance even further when the resource underutilization problem existed in the conventional accelerators because M was smaller than T m . In DCGAN, this situation occurred in the last deconvolutional layer. However, the speedup was not significant, because there was little difference between T m and M . Since the resource underutilization problem was not apparent in the CLP of the DCGAN, the performance was improved only by the advantage of performing kernel computation in shorter cycles. Therefore, the proposed method was 3.59 times faster than the conventional method in the DCGAN.
Secondly, FSRCNN uses the deconvolutional layer as the last layer and can set the resolution of the output image according to S. The width of the kernel size K D is 9, as shown in Table V . Unlike in DCGAN, 88.9% of the hardware resources were idle in the deconvolutional layer of FSRCNN. This is because the number of output feature maps M was nine times smaller than T m . Thus, our load balanceaware TDC method could reduce the ratio of idle hardware to 55.5% when the value of S was 2. Even when the value of S was 3, all the idle hardware was activated, resulting in a performance improvement of 81 times compared to the conventional method. However, as shown in Table III , there were no zero-valued weights in the filters, and therefore we could not take advantage of sparse matrix multiplication in this case. However, Table III shows that the ratio of zero- In other words, the load imbalance was serious, because the difference in the activated resources of the PEs was large. As a result, we evenly distributed the operations that the PEs executed unequally in conventional kernel computation. Therefore, our accelerator was able to run 108 times faster than the conventional accelerator using the same hardware resources.
B. Implementation Results of Proposed SR System
We evaluated our proposed DNN-based SR system using Vivado 2016.4. We designed our overall architecture with Verilog RTL. Our FPGA was connected to a 2880 × 1280 (QHD) panel. size to maximize the utilization of the DSP in the Kintex-7 410T FPGA, the DSPs were fully utilized. Additionally, BRAM usage was 26% of the total available, with the advantage that the CNN model was smaller. Fig. 16 shows the hardware resources breakdown result of the Light FSRCNN. Among all the modules, DeConv1 module has the highest resource usage. We simulated using Xilinx Power Estimation and Analysis Tools to measure the power consumption of the FPGA board. The total thermal power was 5.38W consisting of sources from combined CLP1 (20.7%), combined CLP2 (14.3%), DeConv1 (35.3%), I/O (8.2%), controller (4.5%), interface (11.47%), and others (5.4%). DNN modules that used a lot of DSP and logic cells had higher power consumption than other sources.
In conventional accelerators, the architecture is designed for either CNNs or DCNNs. However, the design of our accelerator is the first that has a hybrid form in which both CNNs and DCNNs are implemented together in the hardware platform. The throughput (GOPS) of each implemented accelerator, shown in the Table VI, was computed as the total computational complexity for spatial convolution divided by the average execution time per image. However, DCNNs increase the computational complexity in proportion to the power of S. For this reason, the rate of computational complexity occupied by the deconvolutional layer in Light FSRCNN was 81.67%, 90.92%, and 94.68% when the value of S was 2, 3, and 4, respectively. However, our DCNN accelerator solved the large loop dimension problem of the output image by using the TDC method. This is because we could simultaneously generate HR images with S 2 channels of LR images using the TDC method, GOPS was higher in proportion to S.
Compared to the other accelerators for object detection and recognition, the OpenCL-based method [21] had the highest throughput using the highest clock rate. However, when power consumption was considered, our accelerator had the highest power efficiency. Our system consumed less power by not using off-chip memory and achieved high performance in Light FSRCNN. Table VI shows the total execution cycles required to run each DNN. Since the computational complexity of AlexNet is approximately 20 times smaller than VGG16 [5] , AlexNetbased accelerator [12] required the fewest cycles. Light FSRCNN has 1.2 times less computational complexity than VGG16. However, our accelerator was at least 3.6 times faster than the other accelerators based on VGG16. Zhang and Prasanna [20] demonstrated that convolution can be performed in the frequency domain through fast Fourier transformation. Although their method had the advantage of less hardware resource usage in relation to performance, it required high power because it included a CPU together with off-chip memory. Another example of an accelerator, the stateof-the-art fusion architecture [18] , reduced the amount of offchip data transfer more than the conventional fusion architecture [17] by optimizing the dataflow between adjacent layers using more BRAMs. As a result, the amount of external memory access was reduced to improve power efficiency. However, a drawback remained in that power was still required for the off-chip memory. Without using off-chip memory, we enhanced power efficiency with an optimized dataflow for on-chip memory. Since we had to move large amounts of intermediate data to BRAMs, FFs usage was higher than other fusion architectures. However, our CNN accelerator was at least three times more power efficient than other accelerators. Table VII shows the hardware implementation results of our proposed system as compared to those of existing SR systems.
Yang et al. [42] implemented anchored neighborhood regression (ANR) in hardware to generate FHD images at 60fps, and Kim et al. [43] generated UHD images at 60 fps using the super-interpolation (SI) method. However, the scale factor supported by their methods was fixed to 2, and therefore, a limitation existed in that they could not generate an output image with a larger resolution image in the same hardware architecture. Our DNN-based SR system required more hardware resources than conventional methods, but can support a variety of scale factors through the deconvolutional layer with the same hardware resources. A virtual input/output (VIO) core [44] is a customizable core that allows virtual inputs and outputs to be added to hardware description language design. This core allowed us to drive internal FPGA signals synchronously or asynchronously. The FSRCNN is characterized by the fact that the weights of the convolutional layers do not change even if the scale factor changes; only the weights of the decon-volutional layer change [25] . We pre-stored the weights of all the deconvolutional layers, each 3.98KB in size, in ROM using the VIO core. Thus, we could obtain the output by adjusting the internal signals through the VIO core without having to re-synthesize to change the weights of the deconvolutional layer stored in the ROM when different scale factors were required.
Table VII also shows that our system could generate QHD at 141 fps when the scale factor is 2. In the case where a UHD video stream was required, our system could generate UHD images at 62.7 fps using approximately twice the number of BRAMs. When the scale factor was greater than 2, the speed of our system was inversely proportional to the input resolution. For example, if the scale factor was 3, an image with a resolution of 1280 × 720, which is smaller than the FHD, was used as the input to generate the UHD image. Table VIII shows a comparison in terms of image quality of various SR methods in different scale factors. We evaluated the performance of the SR systems and algorithms on the datasets that are most frequently used. Although our system had a lower performance than the FSRCNN for hardware implementation, our method could achieve higher image quality than existing systems. Fig. 17 shows the reconstructed images of CNNbased SR algorithms and bicubic interpolation. Our Light FSRCNN used 5,707 fewer parameters than SRCNN, but output images had perceptual quality similar to those of other algorithms. Fig. 18 demonstrates our DNN-based SR system in a mobile panel. We confirmed that HR images can be generated from the QHD panel for mobile applications. In the future, we will further increase the sparsity of the DNN to implement the DNN-based SR as a more hardware-friendly architecture. In addition, we will improve the resource efficiency of our DCNN accelerator by transforming the feature maps into the frequency domain.
VI. CONCLUSION
In this paper, we proposed an energy-efficient DNNbased SR architecture for hardware implementation. First, we presented a novel methodology to optimize the dataflow for effectively designing the DCNN with higher computational complexity than the CNN in hardware implementation. In addition, we proposed an energy-efficient DNN architecture. Our experimental results showed that our DCNN accelerator achieved a speed of up to 108 times faster than a conventional DCNN accelerator with the same hardware resources. Moreover, the proposed DNN-based SR system was shown to be at least three times more power efficient than the state-of-the-art implementations.
