713 research outputs found
LBS: Loss-aware Bit Sharing for Automatic Model Compression
Low-bitwidth model compression is an effective method to reduce the model
size and computational overhead. Existing compression methods rely on some
compression configurations (such as pruning rates, and/or bitwidths), which are
often determined manually and not optimal. Some attempts have been made to
search them automatically, but the optimization process is often very
expensive. To alleviate this, we devise a simple yet effective method named
Loss-aware Bit Sharing (LBS) to automatically search for optimal model
compression configurations. To this end, we propose a novel single-path model
to encode all candidate compression configurations, where a high bitwidth
quantized value can be decomposed into the sum of the lowest bitwidth quantized
value and a series of re-assignment offsets. We then introduce learnable binary
gates to encode the choice of bitwidth, including filter-wise 0-bit for filter
pruning. By jointly training the binary gates in conjunction with network
parameters, the compression configurations of each layer can be automatically
determined. Extensive experiments on both CIFAR-100 and ImageNet show that LBS
is able to significantly reduce computational cost while preserving promising
performance.Comment: 22 page
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
We present APQ for efficient deep learning inference on resource-constrained
hardware. Unlike previous methods that separately search the neural
architecture, pruning policy, and quantization policy, we optimize them in a
joint manner. To deal with the larger design space it brings, a promising
approach is to train a quantization-aware accuracy predictor to quickly get the
accuracy of the quantized model and feed it to the search engine to select the
best fit. However, training this quantization-aware accuracy predictor requires
collecting a large number of quantized pairs, which involves
quantization-aware finetuning and thus is highly time-consuming. To tackle this
challenge, we propose to transfer the knowledge from a full-precision (i.e.,
fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy
predictor, which greatly improves the sample efficiency. Besides, collecting
the dataset for the fp32 accuracy predictor only requires to evaluate neural
networks without any training cost by sampling from a pretrained once-for-all
network, which is highly efficient. Extensive experiments on ImageNet
demonstrate the benefits of our joint optimization approach. With the same
accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ
achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU
hours and CO2 emission, pushing the frontier for green AI that is
environmental-friendly. The code and video are publicly available.Comment: Accepted by CVPR 202
Automated Pruning for Deep Neural Network Compression
In this work we present a method to improve the pruning step of the current
state-of-the-art methodology to compress neural networks. The novelty of the
proposed pruning technique is in its differentiability, which allows pruning to
be performed during the backpropagation phase of the network training. This
enables an end-to-end learning and strongly reduces the training time. The
technique is based on a family of differentiable pruning functions and a new
regularizer specifically designed to enforce pruning. The experimental results
show that the joint optimization of both the thresholds and the network weights
permits to reach a higher compression rate, reducing the number of weights of
the pruned network by a further 14% to 33% compared to the current
state-of-the-art. Furthermore, we believe that this is the first study where
the generalization capabilities in transfer learning tasks of the features
extracted by a pruned network are analyzed. To achieve this goal, we show that
the representations learned using the proposed pruning methodology maintain the
same effectiveness and generality of those learned by the corresponding
non-compressed network on a set of different recognition tasks.Comment: 8 pages, 5 figures. Published as a conference paper at ICPR 201
Learning Accurate Performance Predictors for Ultrafast Automated Model Compression
In this paper, we propose an ultrafast automated model compression framework
called SeerNet for flexible network deployment. Conventional
non-differen-tiable methods discretely search the desirable compression policy
based on the accuracy from exhaustively trained lightweight models, and
existing differentiable methods optimize an extremely large supernet to obtain
the required compressed model for deployment. They both cause heavy
computational cost due to the complex compression policy search and evaluation
process. On the contrary, we obtain the optimal efficient networks by directly
optimizing the compression policy with an accurate performance predictor, where
the ultrafast automated model compression for various computational cost
constraint is achieved without complex compression policy search and
evaluation. Specifically, we first train the performance predictor based on the
accuracy from uncertain compression policies actively selected by efficient
evolutionary search, so that informative supervision is provided to learn the
accurate performance predictor with acceptable cost. Then we leverage the
gradient that maximizes the predicted performance under the barrier complexity
constraint for ultrafast acquisition of the desirable compression policy, where
adaptive update stepsizes with momentum are employed to enhance optimality of
the acquired pruning and quantization strategy. Compared with the
state-of-the-art automated model compression methods, experimental results on
image classification and object detection show that our method achieves
competitive accuracy-complexity trade-offs with significant reduction of the
search cost.Comment: Accepted to IJC
- …