Abstract: Deep Neural Networks have achieved remarkable progress during the past few years and are currently the fundamental tools of many intelligent systems. At the same time, the computational complexity and resource consumption of these networks are also continuously increasing. This will pose a significant challenge to the deployment of such networks, especially for real-time applications or on resource-limited devices. Thus, network acceleration have become a hot topic within the deep learning community. As for hardware implementation of deep neural networks, a batch of accelerators based on FPGA/ASIC have been proposed these years. In this paper, we provide a comprehensive survey about the recent advances on network acceleration, compression and accelerator design from both algorithm and hardware side. Specifically, we provide thorough analysis for each of the following topics: network pruning, low-rank approximation, network quantization, teacher-student networks, compact network design and hardware accelerator. Finally, we make a discussion and introduce a few possible future directions.
Introduction
In recent years, Deep Neural Networks (DNNs) have achieved remarkable performance for a wide range of applications, including but not limited to computer vision, natural language processing, speech recognition, etc. These breakthroughs are closely related to the increased training data and more powerful computing resources. For example, one breakthrough within natural image recognition field is achieved by AlexNet (Krizhevsky et al., 2012) , which is trained using multiple graphics processing units (GPUs) on about 1.2M images. Since then, the performance of DNNs has been continuously improving. For many tasks, DNNs are reported to be able to outperform humans. The problem, however, is that the computational complexity as well as the storage requirements of these DNNs are also increasing drastically as shown in Table 1 . Specifically, the widely used VGG-16 model (Simonyan and Zisser- ‡ Corresponding author c Zhejiang University and Springer-Verlag GmbH Germany 2017 man, 2014) involves more than 500MB storage and over 15B FLOPs to classify a single 224 × 224 image.
Thanks to recent powerful GPUs and CPU clusters equipped with more memory resources and computational units, these more powerful DNNs can be trained within relatively affordable time. However, at inference time, such long executing time is impractical for real-time applications. Recent years have witnessed great progress in embedded and mobile devices like unmanned drones, smart phones, intelligent glasses, etc. The demand of deploying DNN models on these devices becomes more intensive. However, the resources of these devices, for example, the storage, computational units as well as the battery power are very limited, which pose a great challenge to accelerate modern DNNs in lowcost settings.
Therefore, how to equip specific hardware with efficient deep networks without significantly lowering the performance has become a critical problem. To deal with this issue, many great ideas and arXiv:1802.00939v1 [cs.CV] 3 Feb 2018 methods from the algorithm side have been investigated during the past few years. Some of these works may focus on model compression while others may focus on acceleration or lowering the power consumption. As for hardware side, a great variety of FPGA/ASIC-based accelerators are proposed for embedded and mobile applications. In this paper, we will give a comprehensive survey of several advanced approaches in network compression, acceleration and accelerator design. We will present the central ideas behind each approaches, giving the similarity and differences between different methods. Finally, we will give some future directions in this area.
The remaining part of this paper is organised as follows. In Section 2, we give the background of network acceleration and compression. From Section 3 to Section 7, we systematically describe a serial of hardware-efficient DNN algorithms, including network pruning, low-rank approximation, network quantization, teacher-student networks and compact network design. In Section 8, we introduce the design and implementation of hardware accelerators based on FPGA/ASIC. In Section 9, we make a discussion and present some future directions, and Section 10 concludes this paper.
Background
Recently deep convolutional neural networks have been quite popular due to their powerful representation capacity. With the huge success of CNNs, the demand of deploying deep networks to real world applications is increasing. However, the huge parameter size and computational complexity are the two key problems for the networks deployment. For the training phase of CNNs, the computational complexity is not a critical problem thanks to the high performance GPUs or CPU clouds. The parameter size also has little effect on the training phase since modern computers have very large disk and memory storage. But things are quite different for the inference phase of CNNs, especially on embedded and mobile devices.
The huge computational complexity brings two problems for deploying CNNs to real-world applications. One is that the inference phase of CNNs becomes slow as computational complexity is large. This makes it difficult to deploy CNNs to real-time applications. The other one is that the dense computation of CNNs will consume much battery power which is limited on mobile devices.
The large parameter size of CNNs mainly consume much storage and run-time memory which are quite limited on embeded devices. In addition, the large parameter size makes it difficult to download new models online on mobile devices.
To solve these problems, network compression and acceleration methods are proposed. In general, the computational complexity of CNNs is dominated by the convolutional layers while the number of parameters is mainly related to the full-connected layers as shown in Table 1 . Thus most network acceleration methods focus on decreasing the computational complexity of convolutional layers, while the network compression methods mainly try to compress the fully-connected layers.
Network Pruning
Pruning method was proposed before deep learning becomes popular, and it has been widely studied in recent years (LeCun et al., 1989; Hassibi and Stork, 1993; Han et al., 2015a,b) . Based on the assumption that lots of parameters in deep networks are unimportant or unnecessary, pruning methods remove unimportant parameters. In this way, pruning methods enlarge parameters' sparsity significantly. The high sparsity of parameters after pruning brings two benefits for deep neural networks. On the one hand, the sparse parameters after pruning require less disk storage since the parameters can be stored in the Compressed Sparse Row format (CSR) or Compressed Sparse Column (CSC) format. On the other hand, computations involved with those pruned parameters are omitted, thus the computational complexity of deep networks can be reduced. According to the granularity of pruning, pruning methods can be categorized into five groups: finegrained pruning, vector-level pruning, kernel-level pruning, group-level pruning and filter-level pruning. Fig. 1 shows pruning methods with different granularities. In the following subsections, we give details of these different pruning methods.
Fine-grained Pruning
Fine-grained pruning methods or vanilla pruning methods remove parameters in an unstructured way, i.e., any unimportant parameters in the convo- lutional kernels can be pruned, as shown in Fig. 1 . Since there is no extra constraints on the pruning patterns, parameters can be pruned with a large sparsity. Early works of pruning (LeCun et al., 1989; Hassibi and Stork, 1993) used the approximate second order derives of loss function w.r.t the parameters to determine the saliency of parameters, and then pruned those parameters with low saliency. But it's unaffordable for deep networks to compute the second order derives due to the huge computational complexity. Recently Han et al. (2015a) proposed the deep compression framework to compress deep neural networks with three steps: pruning, quantization, Huffman encoding. By using this method, AlexNet can be compressed by 35× without accuracy drops. After pruning, the pruned parameters in (Han et al., 2015a) stay unchanged, thus those incorrect pruning may cause accuracy drops . To solve this problem, Guo et al. (2016) proposed the dynamic network surgery framework which consists of two operations: pruning and splicing. The pruning operation aims to prune those unimportant parameters while the splicing operation aims to recover those incorrect pruned connections. Their method requires less training epochs and achieves better compression ratio than (Han et al., 2015a) .
Vector-level and Kernel-level Pruning
Vector-level pruning methods prune vectors in the convolutioal kernels, and kernel-level pruning methods prune 2-D covolutional kernels in the filters. Since most pruning methods focus on fine-grained pruning and filter-level pruning, there are few works on vector-level and kernel-level pruning. Anwar et al. (2017) first explored the kernel-level pruning, and they also proposed a intra-kernel strided pruning method which prunes a sub-vector in a fixed stride. Mao et al. (2017) explored different level granularities of pruning, and they found that vector-level pruning takes less storage than fine-grained pruning because vector-level pruning requires fewer indices to indicate those pruned parameters. Besides, vectorlevel, kernel-level, and filter-level pruning are more friendly to hardware implementations since they are more structured pruning methods.
Group-level Pruning
Group-level pruning methods prune the parameters by the same sparse pattern on the filters. As shown in Fig. 2 , each filter has the same sparsity pattern, thus the convolutional filters can be represented as a thinned dense matrix. By using grouplevel pruning, convolutions can be implemented by thinned dense matrices multiplication. As a result, the Basic Linear Algebra Subprograms (BLAS) can be utilized to achieve a higher speed-up. Lebedev and Lempitsky (2016) proposed the Group-wise Brain Damage which prunes the weight matrix in a group-wise fashion. By using group-sparsity regularizers, deep networks can be trained easily with groupsparsified parameters. Since group-level pruning can utilize BLAS library, practical speed-up is almost linear in the sparsity level. By using this method, they achieved 3.2× speed-up for all convolutional layers of AlexNet. Concurrent with (Lebedev and Lempitsky, 2016) , Wen et al. (2016) proposed to use the group Lasso to prune groups of parameters. Differently, Wen et al. (2016) explored different levels of structured sparsity in terms of filters, channels, filter shapes, and depth. Their method can be regarded as a more general group-regularized pruning method. For AlexNet's convolutional layers, their method achieves about 5.1× and 3.1× speed-ups on CPU and GPU respectively.
Filter-level Pruning
Filter-level pruning methods prune the convolutional filters or channels which makes the deep networks thinner. After pruning filters for one layer, the next layers' channels are also pruned. Thus, filterlevel pruning is more efficient for accelerating the deep networks. Luo et al. (2017) proposed a filterlevel pruning method named ThiNet. They used the next layer's feature map to guide the filter pruning in current layer. By minimizing the feature map's reconstruction error, they select the channels in a greedy way. Similar to (Luo et al., 2017) , He et al. (2017) proposed an iterative two-step algorithm to prune filters by minimizing the feature map errors. Specifically, they introduced a selection weight β i for each filter W i , then added sparse constraints on β i . Then the channel selection problem can be a LASSO regression problem. To minimize the feature map errors, they iteratively updated β and W.
And their method achieved 5× speed-up on VGG-16 network with little accuracy drop. Instead of using additional selection weight β, Liu et al. (2017) proposed to leverage the scaling factor of batch normlization layer for evaluating the importance of filters. By pruning channels with near-zero scaling factors, they can prune filters without introducing overhead to the networks.
Low-rank Approximation
The convolutional kernel of a convolutional layer W ∈ R w×h×c×n is a 4-D tensor. These four dimensions correspond to the kernel width, kernel height, the number of input and output channels respectively. Note that by merging some of the dimensions, the 4-D tensor can be transformed into t-D (t = 1, · · · 4) tensor. The motivation behind low-rank decomposition is to find an approximate tensorŴ that is close to W but facilitates more efficient computation. Many low-rank based methods are proposed by the community, two key differences is how to rearrange the four dimensions, and on which dimension the low-rank constraint is imposed. Here we roughly divide the low-rank based methods into three categories, according to how many components the filters are decomposed into: two-component decomposition, three-component decomposition and fourcomponent decomposition.
Two-component Decomposition
For two-component decomposition, the weight tensor is divided into two parts and the convolutional layer is replaced by two successive layers. Jaderberg et al. (2014) decomposed the spatial dimension w * h into w * 1 and 1 * h filters. They achieved 4.5× speedup for a CNN trained on text character recognition dataset, with 1% accuracy drop.
SVD is a popular low-rank matrix decomposition method. By merging the dimensions w, h and c, the kernel becomes a 2-D matrix of size (w * h * c)×n, on which the SVD decomposition method can be conducted on. In Denil et al. (2013) , the authors utilized SVD to reduce the network redundancy. The SVD decomposition was also investigated in Zhang et al. (2015) , in which the filters were replaced by two filter banks: one consists of d filters of shape w × h × c and the other is composed of n filters of shape 1 × 1 × d filter. Here d represents the rank of the decomposi-tion, i.e., the n filters are linear combinations of the first d filters. They also proposed the non-linear response reconstruction method based on the low-rank decomposition. On the challenging VGG-16 model for ImageNet classification task, this two-component SVD decomposition method achieved 3× theoretical speedup at the cost of about 1.66% increased top-5 error.
Similarly, another SVD decomposition method can be used, by exploring the low-rank property along the input channel dimension c. In this way, we reshape the weight tensor into a matrix of size c × (w * h * n). By selecting the rank to d, the convolution can be decomposed by first a 1 × 1 × c × d convolution and then a w × h × d × n convolution. These two decomposition are symmetric.
Three-component Decomposition
Based on the analysis of the two-component decomposition methods, one straightforward threecomponent decomposition method can be obtained by two successive two-component decomposition. Note that in the SVD decomposition, two weight tensors are introduced. The first is a w × h × c × d tensor and the other is a d × n tensor (matrix). The first convolution is also very time-consuming due to the large size of the first tensor. We can also conduct a two-component decomposition on the first weight tensor after the SVD decomposition, which turning into a three-component decomposition method. This strategy was studied in Zhang et al. (2015) , in which after the SVD decomposition, they utilized the decomposition method proposed by Jaderberg et al. (2014) for the first decomposed tensor. Thus the final three components were convolutions with spatial size of w × 1, 1 × h and 1 × 1 respectively. By utilizing this three-component decomposition, only 0.3% increased top-5 error was achieved in for 4× theoretical speedup.
If we use the SVD decomposition along the input channel dimension for the first tensor after the twocomponent decomposition, we can get the Tucker decomposition format as proposed by Kim et al. (2015) . These three components are convolutions of spacial size 1 × 1, w × h and another 1 × 1 convolution. Note that instead of using the two-step SVD decomposition, Kim et al. (2015) utilized the Tucker decomposition method directly to obtain these three components. Their method achieved 4.93× theoretical speedup at the cost of 0.5% increased top-5 accuracy.
To further lower the complexity, Wang and Cheng (2016) proposed a Block-Term Decomposition (BTD) method based on low-rank and group sparse decomposition. Note that in the Tucker decomposition, the second component corresponding to the w × h convolution also needs a large number of computations. Because the second tensor is already low-rank along both the input and output channel dimensions, the decomposition methods discussed above can not be used any longer. Wang and Cheng (2016) proposed to approximate the original weight tensor by the sum of some smaller subtensors, each of which is in the Tucker decomposition format. By rearranging these subtensors, the BTD can be seen as a Tucker decomposition where the second decomposed tensor is a block diagonal tensor. By using this decomposition, they achieved 7.4% actual speedup for the VGG-16 model, at the cost of 1.3% increased on top-5 error.
Four-component Decomposition
By exploring the low-rank property along input/output channel dimension as well as the spatial dimension, the four-component decomposition can be obtained. This is corresponds to the CP-decomposition accelerating method proposed in Lebedev et al. (2014) . In this way, the four components are convolutions of size 1 × 1, w × 1, 1 × h and 1 × 1. The CP-decomposition can achieve very high speedup ratio, however, due to the approximate error, only the second layer of AlexNet was processed in Lebedev et al. (2014) . They achieved 4.5× for the second layer of AlexNet at the cost of about 1% accuracy drop.
Network Quantization
Quantization is a approach for many compression and acceleration applications. It has a wide applications in image compression, information retrieval, etc. Many quantization methods are also investigated for network acceleration and compression. We mainly categorize these methods into two groups: scalar and vector quantization, which may need a codebook for quantization, and fixed-point quantization.
Scalar and Vector Quantization
Scalar and vector quantization technique has a long history, and it was original used for data compression. By using scalar or vector quantization, the original data can be represented by a codebook and a set of quantization codes. The codebook contains a set of quantization centers, and the quantization codes are used to indicate the assignment of quantization centers. In general, the number of quantization centers is far smaller than the number of original data. Besides, quantization codes can be encoded by lossless encoding method e.g. Huffman coding, or just represented as low-bit fixed points. Thus scalar or vector quantization can achieve high compression ratio. Gong et al. (2014) explored scalar and vector quantization techniques for compressing deep networks. For scalar quantization, they used the well-known K-means algorithm to compress the parameters. In addition, product quantization algorithm (PQ) (Jegou et al., 2011) , a special case of vector quantization, was leveraged to compress the fully-connected layers. By partitioning the feature space into several disjoint subspaces and then conducting K-means in each subspace, PQ algorithm can compress the fully-connected layers with little loss. As (Gong et al., 2014) only compressed the fully-connected layers, Wu et al. (2016) proposed to utilize PQ algorithm to simultaneously accelerate and compress convolutional neural networks. They proposed to quantize the convolutional filters layer by layer via minimizing the feature map's reconstruction loss. During the inference phase, a look-up table is built by pre-computing the inner product between feature map patches and codebooks, then the output feature map can be calculated by simply accessing the look-up table. By using this method, they can achieve 4 ∼ 6× speed-up and 15 ∼ 20× compression ratio with little accuracy drop.
Fixed-point Quantization
Fixed-point quantization is an effective approach for lowering resource consumption of networks. Based on which part is quantized, two main categories can be classified, i.e., weight quantization and activation quantization. There are some other works that try to also quantize gradients, which can result in acceleration at the network traning stage. We maily review weight quantization and activation quantization methods, which accelerate the testphase computation. Table 2 summarizes these methods according to which part is quantized and whether training and testing stage can be accelerated.
Fixed-point Quantization of Weights
Fixed-point weight quantization has been an old topic for network acceleration and compression. Hammerstrom (2012) proposed a VLSI architecture for network acceleration using 8-bit input and output and 16-bit internal representation. Later in Holi and Hwang (1993) , the authors provided a theoretical analysis of error caused by low-bit quantization to determine the bit-width for a multilayer perceptron. They showed that 8-16 bit quantization was sufficient for training small neural networks. These early works mainly focused on simple multilayer perceptron. A more recent work (Chen et al., 2014) showed that it is necessary to use 32-bit fixed-point for the convergence of a convolutional neural network trained on MNIST dataset. By using stochastic rounding, the work (Gupta et al., 2015) found that it is sufficient to use 16-bit fixed-point numbers to train a convolutional neural network on MNIST. 8-bit fixed-point quantization was also investigated in Dettmers (2015) to speed up the convergence of deep networks in the parallel training. Logarithmic data representation was also investigated in Miyashita et al. (2016) .
Recently, much lower bit quantization or even binary and ternary quantization methods were investigated. Expectation Backpropagation (EBP) was introduced in Cheng et al. (2015) , which utilized variational Bayes method to binarize the network. The BinaryConnect method proposed in Courbariaux et al. (2015) constrained all weights to be either +1 or -1. By training from scratch, the BinaryConnect can even outperform the floating-point counterpart on CIFAR-10 (Krizhevsky and Hinton, 2009) image classification dataset. Using binary quantization, the network can be compressed by about 32 times compared to 32-bit floating-point networks. Most of the floating-point multiplication can also be eliminated (Lin et al., 2015) . In Rastegari et al. (2016) , the authors proposed the Binary Weight Network (BWN), which was among the earliest works that achieved good results on large dataset of ImageNet (Russakovsky et al., 2015) . Loss-aware binarization was proposed in (Hou et al. (2016) ), which can directly Zhou et al. (2017) gradually turned all weights into logarithmic format in a multi-step manner. This incremental quantization strategy can lower the quantization error during each stage, thus can make the quantization problem much easier. All these low-bit quantization methods discussed above directly quantize the full-precision weight into fixed-point format. In Wang and Cheng (2017) , the authors proposed a very different quantization strategy. In stead of direct quantizaiton, they proposed to use a Fixed-point Factorization network (FFN) to quantize all weights into ternary values. This fixed-point decomposition method can significantly lower the quantization error. The FFN method achieved comparable results on commonly used deep models such as AlexNet, VGG-16 and ResNet.
Fixed-point Quantization of Activations
With only weight quantization, there is also the need of the time-consuming floating-point operations. If the activations were also quantized into fixed-point values, the network can be efficiently executed by only fixed-point operations. Many activation quantization methods were also proposed by the deep learning community. Bitwise neural network was proposed in Kim and Smaragdis (2016) . Binarized Neural Networks (BNN) was among the first works that quantize both weights and activations into either -1 or +1. BNN achieved comparable accuracy with the full-precision baseline on CIFAR-10 dataset. To extend the BNN for ImageNet classification task, the authors of (Tang et al., 2017) improved the training strategies of BNN. Much higher accuracy was reported using these strategies. Based on the BWN, the authors of (Rastegari et al., 2016) further quantize all activations into binary values, making the network into XNOR-net. Compared with BNN, the XNOR-net can achieve much higher accuracy on ImageNet dataset. To further understand the effect of bit-with on the training of deep neural networks, the DoReFa-net was proposed in Zhou et al. (2016) . It investigated the effect of different bit-with for weights, activations as well as gradients. By making use of batch normalization, the work (Cai et al., 2017) presented the Half-wave Gaussian Quantization (HWGQ) method to quantize both weights and activations. High performance was achieved on commonly used CNN models using HWGQ methond, with 2-bit activations and binary weights.
Teacher-student Network
Teacher-student network is different with those network compression or acceleration methods since it trains a student network by a teacher network and the student network can be designed with totally different network architecture. Generally speaking, a teacher network is a large neural network or the ensemble of neural networks while a student network is a compact and efficient neural network. By utilizing the dark knowledge transferred from the teacher network, the student network can achieve higher accuracy than training merely by the class labels. Hinton et al. (2015) proposed the knowledge distillation (KD) method which trains a student network by the softmax layer's output of teacher network. Following this line, Romero et al. (2014) proposed the FitNets to train a deeper and thinner student network. Since the depth of neural networks is more important than the width of them, a deeper student network would have higher accuracy. Besides, they utilized the both intermediate layers' featuremaps and soft outputs of teacher network to train the student network. Rather than mimicing the intermediate layers' featuremaps, Zagoruyko and Komodakis (2016) proposed to train a student network by imitating the attention maps of a teacher network. Their experiments showed that the attention maps are more important than layers' activations and their method can achieve higher accuracy than FitNets.
Compact Network Design
The objective of network acceleration and compression is to optimize the executation and storage framework for a given deep neural network. One property is that the network architecture is not changed. Another parallel branch for network acceleration and compression is to design more efficient but low-cost network architecture itself.
In Lin et al. (2013) , the authors proposed Network-In-Network architecture, where 1 × 1 convolution were utilized to increase the network capacity while keep the overall computational complexity small. To reduce the storage requirement of the CNN models, they also proposed to remove the fully-connected layer and make use of a global average pooling. These strategies are also used by many state-of-the-art CNN models like GoogleNet (Szegedy et al., 2015) and ResNet (He et al., 2016) .
Branching (mutiple group convolution) is another commonly used strategy for lowering network complexity, which was explored in the work of GoogleNet (Szegedy et al., 2015) . By largely making use of 1 × 1 convolution and the branching strategy, the SqueezeNet proposed in Iandola et al. (2016) achieved about 50× compression than AlexNet, with comparable accuracy. By branching, the work of ResNeXt (Xie et al., 2017) can achieve much higher accuracy than the ResNet (He et al., 2016) at the same computational budget. The depthwise convolution proposed in MobileNet (Howard et al., 2017) takes the branching strategy to the extreme, i.e., the number of branches equals to the number of input/output channels. The resulting MobileNet can be 32× smaller and 27× faster than VGG-16 model, with comparable image classification accuracy on ImageNet. When using depthwise convolution and 1 × 1 convolution as in MobileNet, most of the computation and parameters reside in the 1 × 1 convolutions. On strategy to further lower the complexity of the 1 × 1 convolution is to use multiple groups. The ShuffleNet proposed in Zhang et al. (2017) introduced the channel shuffle operation to increase the information change within the multiple groups, which can prominently increase the representation power of the networks. Their method achieved about 13× actual speedup over AlexNet with comparable accuracy.
8 Hardware Accelerator
Background
Deep neural networks provide impressive performance on various tasks while suffering from huge computational complexity. Traditionally, algorithms based on deep neural networks should be executed on general purpose platforms such as CPUs and GPUs, but this work at the cost of unexpected power consumption and oversized resource utilization for both computing and storage. In recent years, there is an increasing number of applications that based on embedded systems, including autonomous vehicles, unmanned drones, security cameras, etc. Considering the demands of high performance, lightweight and low energy cost on these devices, CPU/GPU-based solutions are no longer suitable. In this scenario, FPGA/ASIC-based hardware accelerators are gaining popularity as efficient alternatives.
General Architecture
The deployment of DNN on a real-world application consists of two phases: training and inference. Network training is known to be expensive in terms of speed and memory, thus is usually carried on GPUs off-line. During inference, the pre-trained network parameters can be loaded either from cloud or from dedicated off-chip memory. Most recently, hardware accelerators for training are given attentions (Ko et al., 2017; Yang, 2017; Venkataramani et al., 2017) , but in this section we mainly focus on the inference phase in embedded settings.
Typically, an accelerator is composed of five parts: data buffers, parameter buffers, processing elements, global controller, off-chip transfer manager, as shown in Fig. 3 . The data buffers are used to caching input images, intermediate data and output predictions, while the weight buffers are mainly used to caching convolutional filters. Processing elements are a collection of basic computing units that executing multiply-adds, non-linearity and any other functions such as normalization, quantization, etc. The global controller is used to orchestrate the computing flow on chip, while off-chip transfers of data and instructions are conducted through a manager. This basic architecture can be found in existing accelerators designed for both specific and general tasks.
Heterogeneous computing is widely adopted in hardware acceleration. For computing-intensive operations such as multiply-adds, it is efficient to fit them on hardware for high throughout, otherwise, data pre-processing, softmax and any other graphic operations can be placed on CPU/GPU for low latency processing.
Processing Elements
Among all of the accelerators, the biggest differences exist in processing elements as they are designed for the the majority computing tasks in deep networks, such as massive multiply-add operations, normalization (batch normalization or local response normalization), non-linearities (ReLU, sigmoid and tanh). Typically, the computing engine of an accelerator is composed of many small basic processing elements, as shown in Fig. 3 , this architecture is mainly designed for fully invest data reuse and parallelism. However, there are a lot of accelerators that with only one processing element for the consideration of less data movements and resource saving (Zhang, 2015; Ma et al., 2017c) .
Optimizing for High Throughput
Since the majority computations in network are matrix-matrix/matrix-vector multiplication, it is critical to deal with the massive nested loops to achieve high throughput. Loop optimization is one of the most frequently adopted techniques in accelerator design (Zhang, 2015; Ma et al., 2017b; Suda, 2016; Alwani et al., 2016; Xiao, 2017; Li et al., 2018) , including loop tiling, loop unrolling, loop interchange, etc. Loop tiling is used to divide the entire data into multiple small blocks in order to alleviate the pressure of on-chip storage (Ma et al., 2017b; Alwani et al., 2016; Qiu, 2016) , while loop unrolling attempts to improve the parallelism of computing engine for high speed (Ma et al., 2017b; Qiu, 2016) . Loop interchange determines the sequential computation order of nested loops, since different orders can result in significantly difference performance. The well-known systolic array can be seen as a combination of the loop optimization methods listed above, which leverages the nature of data locality and weight sharing in network to achieve high throughput (Jouppi, 2017; Wei, 2017) .
SIMD-based computation is another way for high throughput. Nguyen et al. (2017) presented a method for packing two low-bit multiplications into a single DSP block to double the computation, Price et al. (2017) also proposed a SIMD-based architec-ture for speech recognition.
Optimizing for Low Energy Consumption
Existing work attempt to reduce energy consumption of a hardware accelerator from both computing and IO perspectives. Horowitz (2014) systematically illustrated the energy cost in terms of arithmetic operations and memory accesses. It demonstrated that operations based on integer are much more cheap than float-point counterparts, and lower bit integer is better. Therefore, most existing accelerators adopt low-bit or even binary data representation (Zhao, 2017; Umuroglu, 2017; Nurvitadhi, 2017) to preserve energy efficiency. Most recently, logrithmic computation that transfers multiplications into bit shift operations has also shown its promise in energy saving (Edward, 2017; Gudovskiy and Rigazio, 2017; Tann et al., 2017) .
Sparsity is gaining an increasing popularity in accelerator design based on the observation that a great amount of arithmetic operations can be discarded to obtain energy efficiency. Han et al. (2016) , Han et al. (2017) and Parashar et al. (2017) designed architectures for image or speech recognition based on network pruning, while Albericio et al. (2016) and Zhang et al. (2016c) proposed to eliminate ineffectual operations based on the inherent sparsity in networks.
Off-chip data transfers heavily exist in hardware accelerators due to the fact that both network parameters and intermediate data are too large to fit on chip. Horowitz (2014) suggested that power consumption caused by DRAM access is several orders of magnitude of SRAM, thus reducing off-chip transfers is an critical issue. Shen et al. (2017) designed a flexible data buffing scheme to reduce bandwidth requirements, Alwani et al. (2016) and Xiao (2017) Many other approaches have been proposed to reduce power consumption. Zhang et al. (2016b) used a pipelined FPGA cluster to realize acceleration, Chen et al. (2017b) presented an energyefficient row stationary scheme to reduce data movements, Zhu et al. (2016b) attempted to reduce power consumption via low-rank approximation.
Design Automation
Recently, design automation frameworks that automatically mapping deep neural networks onto hardware gain an increasing attention. , Sharma et al. (2016) , Venieris and Bouganis (2016) and Wei (2017) proposed frameworks that automatically generate synthesizable accelerator for a given network. Ma et al. (2017a) presented an RTL compiler for FPGA implementation of diverse networks. Liu et al. (2016) proposed an instruction set for hardware implementation, while Zhang et al. (2016a) proposed a uniformed convolutional matrix multiplication representation for CNNs.
Emerging Techniques
In the past few years, there are a lot of new techniques both from algorithm side and circuit side are adopted to implement fast and energy-efficient accelerators. Stochastic computing that represents continuous values by streams of random bits are investigated for hardware acceleration of deep neural networks (Ren et al., 2017; Sim and Lee, 2017; Kim et al., 2016b) . On the hardware side, RRAM-based accelerators (Chen et al., 2017a; Xia et al., 2016) and the usage of 3D DRAM (Kim et al., 2016a; Gao et al., 2017) are given more attentions.
The Trend and Discussion
In this section, we want to give some possible future directions in this topic.
Non-finetuning or Unsupervised Compression. Most of existing methods, including network pruning, low-rank compression and quantization, need labeled data to retrain the network for accuracy retaining. The problems are twofold. First, labeled data is sometimes unavailable, like the medical images. Another problem is that retraining needs a lot of human work as well as professional knowledge. These two problems raise the need for unsupervised compression or even finetuning-free compression methods.
Scalable (Self-adaptive) Compression. Current compression methods have lots of hyper parameters to be determined ahead. For example, the sparsity of network pruning, the rank of decomposition-based methods or the bit-width of fixed-point quantization methods. The selection of these hyper parameters is a tedious work, which also needs professional experiences. Thus the investigation of methods which do not rely on human-designed hyper parameters is a promising research topic. One direction may be using annealing methods, or reinforcement learning.
Network Acceleration for Object Detection. Most of model acceleration methods are optimized for image classification, yet very few efforts have devoted to accelerate other computer vision tasks such object detection. It seems that model acceleration methods for image classification can be directly used for detection. However, the deep neural networks for object detection or image segmentation are more sensitive to model acceleration methods, i.e using the same model acceleration methods for object detection would suffer from more accuracy drops than image classification. One reason for this phenomenon may be that object detection requires more complex feature representation than image classification. How to design model acceleration methods for object detection seems a challenge.
Hardware-software Co-design. To accelerate deep learning algorithms on dedicated hardware, a straightforward method is to pick up a model and design a corresponding architecture. However, the gap between algorithm modeling and hardware implementation will make it difficult to put in practice. Recent advances on deep learning algorithms and hardware accelerators demonstrate that it is highly desirable to design hardware-efficient algorithms according to low-level features of specific hardware platforms. This co-design methodology will be a trendcy in future work.
Conclusion
Deep neural networks provide impressive performance while suffering from huge computational complexity and energy cost. In this paper, we give a survey of recent advances in efficient processing of deep neural networks from both algorithm and hardware side. Besides, we point out a few of topics that deserve to look at in future.
