5 research outputs found
LOss-Based SensiTivity rEgulaRization: Towards deep sparse neural networks
International audienc
Structured Bayesian Compression for Deep Neural Networks Based on The Turbo-VBI Approach
With the growth of neural network size, model compression has attracted
increasing interest in recent research. As one of the most common techniques,
pruning has been studied for a long time. By exploiting the structured sparsity
of the neural network, existing methods can prune neurons instead of individual
weights. However, in most existing pruning methods, surviving neurons are
randomly connected in the neural network without any structure, and the
non-zero weights within each neuron are also randomly distributed. Such
irregular sparse structure can cause very high control overhead and irregular
memory access for the hardware and even increase the neural network
computational complexity. In this paper, we propose a three-layer hierarchical
prior to promote a more regular sparse structure during pruning. The proposed
three-layer hierarchical prior can achieve per-neuron weight-level structured
sparsity and neuron-level structured sparsity. We derive an efficient
Turbo-variational Bayesian inferencing (Turbo-VBI) algorithm to solve the
resulting model compression problem with the proposed prior. The proposed
Turbo-VBI algorithm has low complexity and can support more general priors than
existing model compression algorithms. Simulation results show that our
proposed algorithm can promote a more regular structure in the pruned neural
networks while achieving even better performance in terms of compression rate
and inferencing accuracy compared with the baselines
Harmonious Coexistence of Structured Weight Pruning and Ternarization for Deep Neural Networks
Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art âŒ21Ă PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset