Search CORE

713 research outputs found

LBS: Loss-aware Bit Sharing for Automatic Model Compression

Author: Cai Jianfei
Chen Peng
Guo Yong
Liu Jing
Shen Chunhua
Tan Mingkui
Zhuang Bohan
Publication venue
Publication date: 15/02/2021
Field of study

Low-bitwidth model compression is an effective method to reduce the model size and computational overhead. Existing compression methods rely on some compression configurations (such as pruning rates, and/or bitwidths), which are often determined manually and not optimal. Some attempts have been made to search them automatically, but the optimization process is often very expensive. To alleviate this, we devise a simple yet effective method named Loss-aware Bit Sharing (LBS) to automatically search for optimal model compression configurations. To this end, we propose a novel single-path model to encode all candidate compression configurations, where a high bitwidth quantized value can be decomposed into the sum of the lowest bitwidth quantized value and a series of re-assignment offsets. We then introduce learnable binary gates to encode the choice of bitwidth, including filter-wise 0-bit for filter pruning. By jointly training the binary gates in conjunction with network parameters, the compression configurations of each layer can be automatically determined. Extensive experiments on both CIFAR-100 and ImageNet show that LBS is able to significantly reduce computational cost while preserving promising performance.Comment: 22 page

arXiv.org e-Print Archive

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

Author: Cai Han
Han Song
Lin Ji
Liu Zhijian
Wang Kuan
Wang Tianzhe
Publication venue
Publication date: 15/06/2020
Field of study

We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. To deal with the larger design space it brings, a promising approach is to train a quantization-aware accuracy predictor to quickly get the accuracy of the quantized model and feed it to the search engine to select the best fit. However, training this quantization-aware accuracy predictor requires collecting a large number of quantized pairs, which involves quantization-aware finetuning and thus is highly time-consuming. To tackle this challenge, we propose to transfer the knowledge from a full-precision (i.e., fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy predictor, which greatly improves the sample efficiency. Besides, collecting the dataset for the fp32 accuracy predictor only requires to evaluate neural networks without any training cost by sampling from a pretrained once-for-all network, which is highly efficient. Extensive experiments on ImageNet demonstrate the benefits of our joint optimization approach. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ. Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU hours and CO2 emission, pushing the frontier for green AI that is environmental-friendly. The code and video are publicly available.Comment: Accepted by CVPR 202

arXiv.org e-Print Archive

DSpace@MIT

Automated Pruning for Deep Neural Network Compression

Author: Bianco Simone
Manessi Franco
Napoletano Paolo
Rozza Alessandro
Schettini Raimondo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/01/2019
Field of study

In this work we present a method to improve the pruning step of the current state-of-the-art methodology to compress neural networks. The novelty of the proposed pruning technique is in its differentiability, which allows pruning to be performed during the backpropagation phase of the network training. This enables an end-to-end learning and strongly reduces the training time. The technique is based on a family of differentiable pruning functions and a new regularizer specifically designed to enforce pruning. The experimental results show that the joint optimization of both the thresholds and the network weights permits to reach a higher compression rate, reducing the number of weights of the pruned network by a further 14% to 33% compared to the current state-of-the-art. Furthermore, we believe that this is the first study where the generalization capabilities in transfer learning tasks of the features extracted by a pruned network are analyzed. To achieve this goal, we show that the representations learned using the proposed pruning methodology maintain the same effectiveness and generality of those learned by the corresponding non-compressed network on a set of different recognition tasks.Comment: 8 pages, 5 figures. Published as a conference paper at ICPR 201

arXiv.org e-Print Archive

Crossref

Learning Accurate Performance Predictors for Ultrafast Automated Model Compression

Author: Liu Shengyu
Lu Jiwen
Wang Ziwei
Xiao Han
Zhou Jie
Publication venue
Publication date: 13/04/2023
Field of study

In this paper, we propose an ultrafast automated model compression framework called SeerNet for flexible network deployment. Conventional non-differen-tiable methods discretely search the desirable compression policy based on the accuracy from exhaustively trained lightweight models, and existing differentiable methods optimize an extremely large supernet to obtain the required compressed model for deployment. They both cause heavy computational cost due to the complex compression policy search and evaluation process. On the contrary, we obtain the optimal efficient networks by directly optimizing the compression policy with an accurate performance predictor, where the ultrafast automated model compression for various computational cost constraint is achieved without complex compression policy search and evaluation. Specifically, we first train the performance predictor based on the accuracy from uncertain compression policies actively selected by efficient evolutionary search, so that informative supervision is provided to learn the accurate performance predictor with acceptable cost. Then we leverage the gradient that maximizes the predicted performance under the barrier complexity constraint for ultrafast acquisition of the desirable compression policy, where adaptive update stepsizes with momentum are employed to enhance optimality of the acquired pruning and quantization strategy. Compared with the state-of-the-art automated model compression methods, experimental results on image classification and object detection show that our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.Comment: Accepted to IJC

arXiv.org e-Print Archive