Contemporary Deep Neural Network (DNN) contains millions of synaptic connections with tens to hundreds of layers. The large computational complexity poses a challenge to the hardware design. In this work, we leverage the intrinsic activation sparsity of DNN to substantially reduce the execution cycles and the energy consumption. An end-to-end training algorithm is proposed to develop a lightweight (less than 5% overhead) run-time predictor for the output activation sparsity on the fly. Furthermore, an energy-efficient hardware architecture, SparseNN, is proposed to exploit both the input and output sparsity. SparseNN is a scalable architecture with distributed memories and processing elements connected through a dedicated on-chip network. Compared with the state-of-the-art accelerators which only exploit the input sparsity, SparseNN can achieve a 10%-70% improvement in throughput and a power reduction of around 50%.
I. INTRODUCTION
Deep Neural Networks (DNNs) are one of the fundamental machine learning models. In the past decade, DNNs have attracted great research interest due to their promising results in various domains, including visual recognition [1] , natural language processing [2] , and artificial intelligence [3] . Although DNNs can outperform many traditional machine learning models, the large computation and storage requirements pose an obstacle to the extensive deployment in embedded applications. Therefore, considering the limited resources in the embedded platform, both algorithmic and architectural optimizations are required to deliver an energy-efficient solution for DNNs.
In order to avoid overfitting and to be biologically plausible, Rectified Linear Unit (ReLU) is extensively used in DNNs, which leads to a large amount of zeros in the activations of hidden layers. It is reported that there is around 50% sparsity in the contemporary DNN models [4] . The zero activations can be exploited for the design of an energy-efficient implementation as the multiplications and memory access associated with these zero activations can be safely bypassed without affecting the performance. The activation sparsity can be classified into two categories: the input activation sparsity and the output activation sparsity. The input activation sparsity refers to the zero activations within the input feature map, and it is already known when the computation starts. On the other hand, the output activation sparsity, indicating the zero activations in the output feature map, is unknown until the computation of the current layer is finished. In this work, we propose an efficient end-to-end training algorithm to form a run-time predictor that can predict the output activation sparsity before the actual computation of the current layer. The computation overhead of making the prediction is less than 5% of the original feedforward calculation. To efficiently exploit these sparsity to achieve high energyefficiency, a specialized hardware architecture, SparseNN is proposed. SparseNN is a scalable Network-on-Chip (NoC) based architecture with distributed processing elements and memories. It can effectively take the advantage of both input and output activation sparsity. From the experimental results, it is shown that the throughput can be improved by 10%∼70% with a power reduction of 50% when these two types of sparsity are jointly utilized.
Traditional deep learning accelerators either take advantage of input activation sparsity [4] , [5] or output activation sparsity [6] . The input activation sparsity can be easily exploited using the leading nonzero detector [7] because the input activation vector a (l) is already known in the feedforward pass. On the other hand, the output activation sparsity can be predicted using a lightweight prediction phase as shown in Fig. 1 , where the prediction phase is obtained from the truncated Singular Value Decomposition (SVD) of weight matrix W (l) [8] . Specifically, the feedforward pass of DNNs using truncated SVD output predictor can then be summarized as follows:
is the sparsity predictor of the output activations and U (l) and V (l) are the first r left-singular vectors and rightsingular vectors of W (l) , respectively. However, the truncated SVD scheme always looks for a solution with the minimum difference of Frobenius norm [8] and may not be an optimal sparsity predictor. Moreover, U (l) and V (l) are only updated once-per-epoch in the training [8] . The static updating rule limits the flexibility of the backpropagation.
In summary, this work brings the following contributions: (a) A novel end-to-end training algorithm is proposed to generate the output sparsity predictor of the neural network. (b) A scalable NoC based architecture is developed to take advantage of both input and output activation sparsity.
II. SPARSITY PREDICTOR: END-TO-END TRAINING
In order to address the limitation of the truncated SVD approach, we propose a more powerful end-to-end training algorithm to search for a better solution for the output sparsity predictor. The internal structure of the predictor keeps the same as [8] , containing U (l) and V (l) . However, they are derived from an end-to-end training phase rather than SVD.
During training we need to backpropagate the gradient of loss into not only the original feedforward pass but also the sparsity predictor pass. Most of the derivative calculation is straightforward except the passing of the derivative from the predictor to U and V . In Eq. (1), the derivative of the sign function will block the output gradient propagate back to U and V during the backpropagation since the value of the derivative is zero for all input except when it is 0. Inspired by [9] , we adopt a similar approach using the "straight-through estimator". The basic idea is to approximate the sign(x) with the piecewise linear function max(−1, min(1, x)), whose derivative is 1 when the input is in [−1, 1]. The modified gradient calculation in the proposed end-to-end training algorithm is summarized in Alg. 1. To regularize the sparsity of the output activations, Algorithm 1: Modified backpropagation algorithm.
we add the 1 norm of the sparsity predictor p (l) to the original loss function to optimize both error rate and sparsity level during training as shown in Alg. 1 Line 3.
III. SPARSENN: A SCALABLE HARDWARE ARCHITECTURE
After the output sparsity predictor U (l) and V (l) are obtained using the proposed end-to-end training algorithm, a specialized hardware architecture is required to accelerate the inference phase of the DNNs with both input and output sparsity. Traditional Single Instruction Multiple Data (SIMD) microarchitecture like [6] [10] is not a scalable solution because the memory bandwidth increases linearly with the SIMD width. To exploit the input sparsity, a distributed hardware accelerator, called EIE, targeted for accelerating DNNs with compressed weights was proposed in [7] . In this work, we enhance the microarchitecture of EIE to exploit both the input and output sparsity.
A. Hierarchical Architecture of SparseNN
SparseNN is a scalable distributed hardware architecture consisting of 64 processing elements (PEs). As shown in Fig. 2 , 64 PEs are connected through a dedicated 3-level on-chip Htree network, which has routers at the leaf-level, the internallevel, and the root-level. The computation of the matrix-vector multiplication is distributed to each PE. More specifically, all rows of the weight matrix W (j,:) , and the input activations i j are stored in the k th PE, and output activations o j are computed by the k th PE, where j mod 64 = k. Since each PE only stores a subset of the input activations, the output activation can not be computed locally until all the input activations are received. As a result, an additional broadcasting stage is required to distribute the local input activations stored in each PE along with the input indices to all PEs through the on-chip network. In order to exploit the input sparsity, only the nonzero activations in the PE will be broadcasted. Each PE starts the local multiplication and accumulation of the input activations as soon as it receives the nonzero input activations from the onchip network. During the inference computation, SparseNN is first used to calculate the sparsity predictor (i.e. Eq. (1)) and then the original matrix-vector multiplication in Eq. (2) is computed. Since the dimensions of the matrices U , V and W are very different, different schedulings for computing these matrices are needed and will be discussed in Section III.C.
B. On-chip Network Design
In EIE, the timing overhead of broadcasting the input activations to the PEs is hidden by the computations of the input activation received at each PE. However, in a general DNN accelerator, the weight matrix may not necessarily be a square one. For example, the weight matrix of the output layer has smaller number of rows. Very few output activations are mapped to each PE if the weight matrix has a smaller number of rows, and hence for each PE, it only takes a few cycles to consume the received input activation. So if the next input activation does not arrive on time, there will be idling cycles and it affects the overall performance. As a result, the on-chip network of SparseNN is deliberately designed to make sure the activation can arrive every clock cycle to keep the datapath in the PE always busy. Here we adopt a general buffered flow control for the on-chip network. Four nonzero input activations are arbitrated at each level of the routing node. The activation with the smallest index will be granted to the next level while the others will be stored in the buffer at the current node, waiting for the arbitration in the next cycle. The transmission of activations is fully pipelined, and hence each PE can receive the data every cycle. The buffered flow control needs additional temporary storage in the routing node but as shown in Section IV, such overhead is negligible.
C. Computation Schedule for Sparsity Predictor
In the original computation scheduling of EIE, each row of the weight matrix W is distributed to one of the 64 PEs, and the corresponding output activations are calculated. We call this the row-based scheduling. However, if the row number of the weight matrix is smaller than the number of PEs, some of PEs become idle under the row-based scheduling. This situation happens for the V matrix in the sparsity predictor because the rank size r is typically smaller than 64. In order to address this limitation of row-based scheduling of the matrix-vector multiplication, we propose another column-based scheduling. In column-based scheduling, the columns (instead of rows) of V are interleavedly mapped to the 64 PEs. Each PE calculates the partial sum of the output activations o on the right hand side. The accumulation of the partial sums is conducted through the 3-level H-Tree, and the final results of the output activations are stored in the root node. The accumulation operation is embedded in the 4-stage pipelined router. The utilization rate of the V computation is closed to 100% even when the rank size r is as low as 16. The following U computation stage in the sparsity predictor uses the original row-based scheduling as the row number is the same as the number of output activations of W , which is usually much greater than 64.
D. Architecture of PE with the Output Sparsity Bypass
The architecture of the PE in SparseNN is shown in Fig. 3 . The datapath of the PE consists of 5 pipeline stages: memory address computation, memory access, multiplication, addition, and write back. The two physical register files are organized as a pair of ping-pong buffers, which alternatively act as the source and destination register files from layer to layer. A complete computation flow of the PE undergoes three matrixvector computation phases for V , U , and W , respectively. V computation phase: The local nonzero input activation a j and its associated index j are scanned from the source register file which stores all local input activations, and pushed into the datapath. The column-based scheduling is then proceeded to calculate the partial sum in each row. When the partial sum of one row is finished, the result will be sent to the on-chip network for the accumulation. The root node receives the final accumulated result of V computation and broadcasts it back to all 64 PEs. U computation phase: With the received V results, the row-based scheduling of U computation is conducted in each PE. In each clock cycle, the PE only processes the head of activation queue, and pushes the locally-stored rows of the U matrix and the results of V computation phase to the datapath. At the end of the U computation phase, the output sparsity predictor p (l) is stored in a dedicated 1-bit register bank. W computation phase: The local nonzero input activation a j and its associated index j are scanned from the source register file, and broadcasted to all the PEs through the H-Tree. After receiving the nonzero input activation and the index, each PE then multiplies the received input activation with the weights of all output activations mapped to the PE that are predicted by the sparsity predictor to be nonzero. In each cycle, the leading nonzero detector of the predictor register bank searches the next nonzero output activation for computation and the intermediate results are stored in the destination registers.
IV. EXPERIMENTAL RESULTS

A. Experimental Setup
We first compare the performance of the proposed end-to-end training algorithm with that of the conventional truncated SVD scheme on MNIST-BASIC dataset (BASIC) along with two challenging variants [11] , ROT and BG-RAND. Two different neural network architectures are explored in this work: the 3layer (with 1 hidden layer) and the 5-layer (with 3 hidden layers). Each hidden layer has 1000 neurons. The nonlinear function in each layer is ReLU.
To evaluate the hardware performance, we implement S-parseNN using Verilog HDL. SparseNN is synthesized using the Synopsys Design Compiler with TSMC 65nm LP library. CACTI 6.5 [12] is used to characterize the memory model. The power consumption of SparseNN is estimated from backannotating the toggling rate to the synthesized netlist.
B. Performance of the End-to-End Training Algorithm
The test error rate (TER) and the predicted output sparsity ρ (l) of the 3-layer neural network are shown in Fig. 4 . From Fig. 4 , we can observe that the proposed end-to-end training algorithm of the sparsity predictor scales well with the rank size of the UV predictor. For instance, the TER of the truncated SVD scheme is around 1% larger than the end-to-end training algorithm in ROT dataset when a small rank size is used. The performance difference is mainly because the UV update is static in the conventional truncated SVD scheme and cannot be tuned.
In Table. I, we compare the TER and the output sparsity at each hidden layer of the 5-layer neural network. The network trained by the proposed end-to-end training algorithm preserves Design, Automation And Test in Europe (DATE 2018) a similar (or even better) accuracy to the SVD approach, but with a higher average sparsity ratio of the hidden layers. It is due to the output sparsity is considered in the end-to-end training algorithm as we use the 1 regularization in the cost function.
C. Performance of the SparseNN
The design parameters of the mircroarchitecture of the proposed architecture, SparseNN, are listed in Table. II. The mi- Table. III. The routing nodes occupy only a small fraction (less than 1%) of the total area, and the major area contributors are the PEs. The main reason is the large on-chip SRAMs for W , U , and V in each PE, which take around 95% of the overall area. The results on the execution cycle and the power consumption of SparseNN on the three benchmarks are shown in Fig. 5 . When UV predictor is not used, SparseNN is the same as the conventional EIE architecture which only exploits the input activation sparsity. From Fig. 5 , it can be seen that the improvement in the number of execution cycles with the output sparsity varies from layer to layer. For the 1st hidden layer, the reduction of cycles ranges from 10%∼31%. The inputs to the 1st hidden layer are the same for the UV enabled and the UV disabled networks, and hence the improvement of throughput only comes from the output sparsity. The difference of the throughput improvement at different layers and benchmarks is due to the difference in predicted output sparsity. In addition, the number of nonzero output activations predicted by the sparsity predictor also varies from PE to PE. For the remaining hidden layers, the reduction of cycles can be as high as 70%. The predicted output sparsity of the previous layer will increase the input sparsity of the current layer. Therefore, the throughput is jointly improved by the input sparsity as well as the output sparsity. The improvement in power consumption with output sparsity is almost uniform among all datasets and all hidden layers: around 50%. The reasons for the power reduction are twofold: the number of access to the large W memory decreases with the output sparsity, and the access energy to the U , V memory during sparsity prediction phase is small.
V. CONCLUSION
In this work, we first propose an end-to-end training algorithm to obtain the U and V matrices for the sparsity predictor from the backpropagation. The scalability with rank and the predicted sparsity are better than the traditional truncated SVD scheme. Then, a specialized architecture, SparseNN, is developed to exploit both the input and output sparsity. Our evaluations demonstrate that with the output sparsity, the throughput of SparseNN can be improved by 10% to 70% while the power consumption is approximately reduced by half.
