As a type of prominent studies in deep learning, generative models have been widely investigated in research recently. Two research branches of the deep learning models, the Generative Networks (GANs, VAE) and the Semantic Segmentation, rely highly on the upsampling operations, especially the transposed convolution and the dilated convolution. However, these two types of convolutions are intrinsically different from standard convolution regarding the insertion of zeros in input feature maps or in kernels respectively. This distinct nature severely degrades the performance of the existing deep learning engine or frameworks, such as Darknet, Tensorflow, and PyTorch, which are mainly developed for the standard convolution. Another trend in deep learning realm is to deploy the model onto edge/ embedded devices, in which the memory resource is scarce. In this work, we propose a Highly Untangled Generative-model Engine for Edge-computing or HUGE 2 for accelerating these two special convolutions on the edge-computing platform by decomposing the kernels and untangling these smaller convolutions by performing basic matrix multiplications. The methods we propose use much smaller memory footprint, hence much fewer memory accesses, and the data access patterns also dramatically increase the reusability of the data already fetched in caches, hence increasing the localities of caches. Our engine achieves a speedup of nearly 5× on embedded CPUs, and around 10× on embedded GPUs, and more than 50% reduction of memory access.
Introduction
Recently, the deep generative models and semantic segmentation algorithms have shown their stunning abilities in various fields, such as creating realistic images from the learned distribution of a given dataset, providing the robots with the ability to learn from environment without human input, generating the synthetic 3D objects for the scene parsing in a scenario, and so forth. These creative deep learning models attract great interests in research by both scholar and industry. The representative works include the Generative Neural Networks (GAN) Goodfellow et al. [2014] , the Variational Auto-encoder (VAE) Kingma and Welling [2013] , and the semantic image segmentation algorithms Shelhamer et al. [2017] , Chen et al. [2017] .
However, the generative models and semantic segmentation algorithms rely heavily on the deconvolution which is an inefficient, and both computation-and memory-intensive operation. The inefficiency comes either from the zero insertions in either input tensor or kernels or from repeatedly accesses to the overlapped regions. Zero insertions cause wasteful computations, hence high latency. The nonconsecutive memory access manner in deconvolutions also hurts system performance drastically. The overlapped region in outputs hinders the concurrent processing because the chained memory-writings happen to the same location.
In this work, we conceive a set of solutions from an algorithmic perspective to improve the performance of deconvolution for embedded systems. And the experiments show that we achieve the speedup nearly 10× on GPUs and 5× on CPU, the memory storage and their accesses are reduced by more than 50 percent.
Background

Deconvolution
The so-called deconvolution used in the deconvolution layers of the generative adversarial networks and the semantic image segmentation is actually not as exact as the reverse operation of the convolution. Actually, the deconvolution layers are learnable up-sampling layers. Two categories of special convolution operations can fulfill such kind of task, they are the transposed convolution and the dilated convolution, respectively. The following subsections give details about how these operations work.
Transposed Convolution
Transposed convolution, also called Fractionally-Strided Convolution, is used not only to upsample an initial layer but also to create new features in enlarged output feature maps. Theoretically, transposed convolution works as a process of swapping the Forward and backward passes of a convolution, and this is where its name comes from. Algorithm 1 describes how this kind of convolution works. As it shows, when s m and s n are bigger than 1, the kernels slide on the feature maps with fractional steps. Figure 1 shows the implementations of the transposed convolution and its counterpart. It is always possible to emulate a transposed convolution with a direct convolution. Such process first spreads the input feature map by inserting zeros (or blank lines) between each pair of rows and columns. The original input tensor I now becomesÎ. It then applies a standard convolution, with strides of 1 (equivalent to sliding step of 1 stride on the original input Dumoulin and Visin [2016] ), on the resulting input representation, as shown in Algorithm 1. Let us take left hand side of Figure 2 as an example. It illustrates a 2D transposed convolution of a zero-inserted feature map of size 6 × 6 and a 3 × 3 transposed kernel.
Dilated Convolution
Strictly speaking, despite of that dilated convolution, also known as atrous convolution, is not acknowledged as a kind of deconvolution, it has been widely explored to upsample input tensors in the semantic image segmentation algorithms. Moreover, it shares some characteristics on which we can apply our acceleration algorithm as it does for the transposed convolution. On the contrary to transposed convolution, dilated convolution inserts zeros into kernels but not input tensors. The kernels are dilated so as to enlarge their corresponding receptive fields. One thing needs to be noticed is that only stride bigger than 1 has the effect of upsampling on input tensors. Details are demonstrated in Algorithm 2 and right side of Figure 2.
Previous Work
To our best knowledge, up to this work, most of optimized solutions have been proposed are from the research of the hardware accelerator realm. These designated designs achieve much higher throughput compared with non-optimized generic hardware. Hence, our goal is to conceive an easily-accessible and cost-efficient solution for the generic hardware.
1. Zero-Skipping: Yazdanbakhsh et al. [2018a,b] present a set of designs by swapping zero rows and columns with non-zeros ones, and then rearranging non-zero rows and columns into effective working groups. However, this design doesn't thoroughly solve the unbalanced working load problem among effective computation groups. Song et al. [2018] discovers the delicate mathematical relation of indices among input tensors, kernels, and output tensors. These relations help rearrange the computations to skip zeros. But this method lacks memory access coalescing. Therefore, input tensors and kernels are accessed in a non-consecutive fashion with degradation in the overall performance of the system.
Reverse Looping and Overlapping:
Reverse Looping is introduced by both Zhang et al.
[2017] and Xu et al. [2018] . This technique avoids accessing the output tensors in an overlapped manner with more operations, especially the accumulations and memory writings. Reverse looping, on the contrary, uses the output space to determine corresponding input blocks, and thus eliminating the need for the additional accumulations and memory accesses. However, the overlapped regions are not evenly distributed, hence the work load unbalancing issue among processing elements is still not well solved by such kind of solution.
Figure 2: left: the transposed convolution of a 3 × 3 kernel over a 6 × 6 input padded with a 1 × 1 border of zeros using 2 × 2 strides; right: dilated convolution of a 3 × 3 kernel over a 7 × 7 input with a kernel of the dilation factor of 2 Dumoulin and Visin [2016] .
Algorithm
In this section, we introduce our algorithm for accelerating the deconvolution operations. Our algorithm consists of three steps: 1) kernel decomposition, 2) untangling of kernels and matrix multiplications, and 3) dispatching and combining the results to the output tensor.
The following subsections provide the explanation for each of them. Figure 3: the kernel of the transpose convolution is decomposed into 4 patterns, each convolves with zero-inserted feature maps with stride 2, the final result is obtained by combining the 4 partial feature maps; the yellow, pink, and blue patches correspond to sliding windows at different positions.
Decompose Deconvolution
Given an input tensor with stride 2 zero-inserted as example, and let us take a transposed kernel to slide on it. We discover that there exists 4 kinds of patterns as shown at the bottom of Figure 3 where the nonzero elements in the kernel meet the nonzero elements in the zero-inserted input tensorÎ, and thus generate non-overlapped effective outputs as shown on the top of Figure 3 . Mathematical description of these 4 patterns is given below:
Pattern 1: odd columns and odd rows of kernel convolve with stride 2 on input tensorÎ and generate even columns and even rows of output tensor O.
Pattern 2: even columns and odd rows of kernel convolve with input tensorÎ and generate odd columns and even rows of output tensor O.
Pattern 3: odd columns and even rows of kernel convolve with input tensorÎ and generate even columns and odd rows of output tensor.
Pattern 4: even columns and even rows of kernel convolve with input tensorÎ and generate odd columns and odd rows of output tensor. One benefit of such decomposition is that the non-overlapped sparse regions on the output tensor do not cause any race conditions for writing in memory, since it will not be blocked by consecutive memory writings.
After having investigated the relation between indices of the nonzeros in input tensor and the decomposed kernels, we draw the conclusion that we can safely remove all the zero inserted in both input tensor and the decomposed kernels as shown in left side of Figure 4 . The flow demonstrates the zero-removal for all patterns where they become 4 smaller standard convolutions on input tensors without zero-insertion. Then we scatter and combine their results. The scattering of the results from each pattern follows the corresponding indices used in the zero-inserted version.
And the simplified flow on the right side of Figure 4 resembles the Inception module of GoogLeNet Szegedy et al. [2015] except that the last step here is scattering and combination instead of stacking.
Untangling
To further improve the parallelism of arithmetic computations, we propose an algorithm to untangle every decomposed pattern of the transposed convolution into a set of 1 × 1 convolutions.
Untangle standard convolution
As shown in the right side of Figure 4 each pattern convolves with input tensor as a standard convolution. To better understand our algorithm, let us take pattern 4 (the 3 × 3 one) as an example. The process is shown in Figure 5 . Given N decomposed kernels of pattern 4 (with zero removed) from previous subsection and the original input tensor, each kernel has dimension of m × n × C and the input tensor has dimension H × W × C. We regroup the elements of the kernels by gathering N columns along the dimension C from every kernel at position (x, y) (e.g. (0, 0) at top left case, (0, 2) for the bottom right case in the Figure 5 ) of the (m, n) plane. These columns form a matrix of dimension N × C. Then, their corresponding receptive fields on the input tensor can be fetched for input tensor to form another matrix of dimension (H − m + 1)(W − n + 1) × C. Such configuration can be regarded as a 1 × 1 convolution with N 1 × 1 kernels working with a cropped tensor. The products of the m × n matrix multiplications are then accumulated together. The elements of the resultant matrix are then dispatched to the corresponding position in the output tensor.
Untangle dilated convolution
Dilated convolution can also take the advantage of untangling. As it shows in left side of Figure 6 untangling technique is also applicable. The sliding step on input tensor is larger, and the receptive field shrink with multiple of stride. 
Training of GAN
The back-propagation of the discriminator of GAN can be seen as a special case of dilated convolution.
Step 3 of right side of Figure 6 depicts that, in order to propagate the derivatives of the output errors, there are C × N convolutions by C input feature maps and N derivative maps. Each derivative map is dilated (since discriminator uses the strided convolution) and convolves with each input feature map.
Therefore, we can make C copies of each of N derivative maps from output errors to form N new dilated kernels of C channels. Then the dilated kernels convolve with input tensor to form the derivates of kernels and the results are subtracted from corresponding kernels.
The dilated convolution in step 4 of right side of Figure 6 is actually a depth-wise version. Hence, it corresponds to C = 1 in left side of Figure 6 which is seen as a outer-product of two vectors.
The back-propagation of generator of GAN can be seen as a strided convolution of derivative maps of output errors and input tensor (not shown in this paper).
Experimental Results
This section provides the evaluation of our algorithms. We use the deconvolution layers of DCGAN Radford et al. [2015] and cGAN Mirza and Osindero [2014] as case study. Their configurations are shown in Table 1 . In this paper, we mainly focus on the inference phase of deconvolution layers, and all models are pretrained with CIFAR100 Krizhevsky et al. dataset. The experiments for GAN's training are only investigated at several typical layers.
The baseline of library we pick up is DarkNet Redmon [2013 Redmon [ -2016 since it is open-sourced, and the commercial library such as cuDNN Chetlur et al. [2014] only delivers with binary code. The system used in our experiments equips with embedded CPU, 4-core ARM Cortex-A57, and a Nvidia's GPU (256-core NVIDIA Pascal TM Embedded GPU). The experiments are run on both embedded CPU and embedded GPU. The metrics used for performance comparison includes the speedup and memory access reduction. We compared our implementations of transposed convolution and dilated convolution with the baseline, which is the naive implementations from DarkNet for both CPU and GPU. Most 2D standard and transpose convolution implementation in modern deep learning library are based on im2col.
Speedup of Computation
The speedup is obtained by the comparison of the computational runtime with the baseline. Figure 7 demonstrates the speedup gained by applying kernel decomposition and untangling for DCGAN and cGAN, respectively. From the results, we can see that the shallower deconvolution layer are more compute-bounded since they have more kernels which deconvolve with input tensor and require much more computational operations. Untangling transposed kernels can efficiently improve the parallelism by taking advantage of larger C and N .
Reduction of Memory Access
The results of experiments for the memory access reduction by decomposition and untangling are provided in Figure 8 .
One more thing we want to mention is that untangling technique we applied favors the C ×N ×R ×S memory layout for the transposed kernels and C × H × W for the input tensor. This is because elements along C and N dimensions are stored consecutively in these layouts, and this helps with the data fetching in coalescing memory access pattern.
As shown in Figure 8 , it is obvious that the deeper deconvolution layers are data-bounded, the reduction can be obtained more on the deeper layers since the output tensor becomes larger by the unsampling effects. We achieve a memory access reduction around 30% to 70% by only applying untangling technique. 
Speedup in GAN training
The right side of Figure 8 plots the speedup of training of GANs. We select several typical layers for the experiments, we want to cover both the cases for dilated derivative maps convolving input tensor and derivative maps stridedly convolving input tensors.
Conclusion
In this paper we presented a set of efficient algorithms and optimizations for deconvolutions, these algorithms are the core components in our deep generative model engine "HUGE". We devised them as pervasive as possible to fit on most hardware platforms. HUGE really accomplishes the outstanding results for our applications. It shows great improvements in two crucial aspects, computation loads and memory access, respectively.
