7,663 research outputs found
Rethinking the Value of Network Pruning
Network pruning is widely used for reducing the heavy inference cost of deep
models in low-resource settings. A typical pruning algorithm is a three-stage
pipeline, i.e., training (a large model), pruning and fine-tuning. During
pruning, according to a certain criterion, redundant weights are pruned and
important weights are kept to best preserve the accuracy. In this work, we make
several surprising observations which contradict common beliefs. For all
state-of-the-art structured pruning algorithms we examined, fine-tuning a
pruned model only gives comparable or worse performance than training that
model with randomly initialized weights. For pruning algorithms which assume a
predefined target network architecture, one can get rid of the full pipeline
and directly train the target network from scratch. Our observations are
consistent for multiple network architectures, datasets, and tasks, which imply
that: 1) training a large, over-parameterized model is often not necessary to
obtain an efficient final model, 2) learned "important" weights of the large
model are typically not useful for the small pruned model, 3) the pruned
architecture itself, rather than a set of inherited "important" weights, is
more crucial to the efficiency in the final model, which suggests that in some
cases pruning can be useful as an architecture search paradigm. Our results
suggest the need for more careful baseline evaluations in future research on
structured pruning methods. We also compare with the "Lottery Ticket
Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate,
the "winning ticket" initialization as used in Frankle & Carbin (2019) does not
bring improvement over random initialization.Comment: ICLR 2019. Significant revisions from the previous versio
Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration
Previous works utilized ''smaller-norm-less-important'' criterion to prune
filters with smaller norm values in a convolutional neural network. In this
paper, we analyze this norm-based criterion and point out that its
effectiveness depends on two requirements that are not always met: (1) the norm
deviation of the filters should be large; (2) the minimum norm of the filters
should be small. To solve this problem, we propose a novel filter pruning
method, namely Filter Pruning via Geometric Median (FPGM), to compress the
model regardless of those two requirements. Unlike previous methods, FPGM
compresses CNN models by pruning filters with redundancy, rather than those
with ''relatively less'' importance. When applied to two image classification
benchmarks, our method validates its usefulness and strengths. Notably, on
CIFAR-10, FPGM reduces more than 52% FLOPs on ResNet-110 with even 2.69%
relative accuracy improvement. Moreover, on ILSVRC-2012, FPGM reduces more than
42% FLOPs on ResNet-101 without top-5 accuracy drop, which has advanced the
state-of-the-art. Code is publicly available on GitHub:
https://github.com/he-y/filter-pruning-geometric-medianComment: Accepted to CVPR 2019 (Oral
Stabilizing the Lottery Ticket Hypothesis
Pruning is a well-established technique for removing unnecessary structure
from neural networks after training to improve the performance of inference.
Several recent results have explored the possibility of pruning at
initialization time to provide similar benefits during training. In particular,
the "lottery ticket hypothesis" conjectures that typical neural networks
contain small subnetworks that can train to similar accuracy in a commensurate
number of steps. The evidence for this claim is that a procedure based on
iterative magnitude pruning (IMP) reliably finds such subnetworks retroactively
on small vision tasks. However, IMP fails on deeper networks, and proposed
methods to prune before training or train pruned networks encounter similar
scaling limitations. In this paper, we argue that these efforts have struggled
on deeper networks because they have focused on pruning precisely at
initialization. We modify IMP to search for subnetworks that could have been
obtained by pruning early in training (0.1% to 7% through) rather than at
iteration 0. With this change, it finds small subnetworks of deeper networks
(e.g., 80% sparsity on Resnet-50) that can complete the training process to
match the accuracy of the original network on more challenging tasks (e.g.,
ImageNet). In situations where IMP fails at iteration 0, the accuracy benefits
of delaying pruning accrue rapidly over the earliest iterations of training. To
explain these behaviors, we study subnetwork "stability," finding that - as
accuracy improves in this fashion - IMP subnetworks train to parameters closer
to those of the full network and do so with improved consistency in the face of
gradient noise. These results offer new insights into the opportunity to prune
large-scale networks early in training and the behaviors underlying the lottery
ticket hypothesisComment: This article has been subsumed by "Linear Mode Connectivity and the
Lottery Ticket Hypothesis" (arXiv:1912.05671, ICML 2020). Please read/cite
that article instea
A Closer Look at Structured Pruning for Neural Network Compression
Structured pruning is a popular method for compressing a neural network:
given a large trained network, one alternates between removing channel
connections and fine-tuning; reducing the overall width of the network.
However, the efficacy of structured pruning has largely evaded scrutiny. In
this paper, we examine ResNets and DenseNets obtained through structured
pruning-and-tuning and make two interesting observations: (i) reduced
networks---smaller versions of the original network trained from
scratch---consistently outperform pruned networks; (ii) if one takes the
architecture of a pruned network and then trains it from scratch it is
significantly more competitive. Furthermore, these architectures are easy to
approximate: we can prune once and obtain a family of new, scalable network
architectures that can simply be trained from scratch. Finally, we compare the
inference speed of reduced and pruned networks on hardware, and show that
reduced networks are significantly faster. Code is available at
https://github.com/BayesWatch/pytorch-prunes.Comment: Preprint. First two authors contributed equally. Paper title has
change
Robust Sparse Regularization: Simultaneously Optimizing Neural Network Robustness and Compactness
Deep Neural Network (DNN) trained by the gradient descent method is known to
be vulnerable to maliciously perturbed adversarial input, aka. adversarial
attack. As one of the countermeasures against adversarial attack, increasing
the model capacity for DNN robustness enhancement was discussed and reported as
an effective approach by many recent works. In this work, we show that
shrinking the model size through proper weight pruning can even be helpful to
improve the DNN robustness under adversarial attack. For obtaining a
simultaneously robust and compact DNN model, we propose a multi-objective
training method called Robust Sparse Regularization (RSR), through the fusion
of various regularization techniques, including channel-wise noise injection,
lasso weight penalty, and adversarial training. We conduct extensive
experiments across popular ResNet-20, ResNet-18 and VGG-16 DNN architectures to
demonstrate the effectiveness of RSR against popular white-box (i.e., PGD and
FGSM) and black-box attacks. Thanks to RSR, 85% weight connections of ResNet-18
can be pruned while still achieving 0.68% and 8.72% improvement in clean- and
perturbed-data accuracy respectively on CIFAR-10 dataset, in comparison to its
PGD adversarial training baseline
Distilling with Performance Enhanced Students
The task of accelerating large neural networks on general purpose hardware
has, in recent years, prompted the use of channel pruning to reduce network
size. However, the efficacy of pruning based approaches has since been called
into question. In this paper, we turn to distillation for model
compression---specifically, attention transfer---and develop a simple method
for discovering performance enhanced student networks. We combine channel
saliency metrics with empirical observations of runtime performance to design
more accurate networks for a given latency budget. We apply our methodology to
residual and densely-connected networks, and show that we are able to find
resource-efficient student networks on different hardware platforms while
maintaining very high accuracy. These performance-enhanced student networks
achieve up to 10% boosts in top-1 ImageNet accuracy over their channel-pruned
counterparts for the same inference time.Comment: Preprint. Paper title has change
Functionality-Oriented Convolutional Filter Pruning
The sophisticated structure of Convolutional Neural Network (CNN) allows for
outstanding performance, but at the cost of intensive computation. As
significant redundancies inevitably present in such a structure, many works
have been proposed to prune the convolutional filters for computation cost
reduction. Although extremely effective, most works are based only on
quantitative characteristics of the convolutional filters, and highly overlook
the qualitative interpretation of individual filter's specific functionality.
In this work, we interpreted the functionality and redundancy of the
convolutional filters from different perspectives, and proposed a
functionality-oriented filter pruning method. With extensive experiment
results, we proved the convolutional filters' qualitative significance
regardless of magnitude, demonstrated significant neural network redundancy due
to repetitive filter functions, and analyzed the filter functionality defection
under inappropriate retraining process. Such an interpretable pruning approach
not only offers outstanding computation cost optimization over previous filter
pruning methods, but also interprets filter pruning process
Really should we pruning after model be totally trained? Pruning based on a small amount of training
Pre-training of models in pruning algorithms plays an important role in
pruning decision-making. We find that excessive pre-training is not necessary
for pruning algorithms. According to this idea, we propose a pruning
algorithm---Incremental pruning based on less training (IPLT). Compared with
the traditional pruning algorithm based on a large number of pre-training, IPLT
has competitive compression effect than the traditional pruning algorithm under
the same simple pruning strategy. On the premise of ensuring accuracy, IPLT can
achieve 8x-9x compression for VGG-19 on CIFAR-10 and only needs to pre-train
few epochs. For VGG-19 on CIFAR-10, we can not only achieve 10 times test
acceleration, but also about 10 times training acceleration. At present, the
research mainly focuses on the compression and acceleration in the application
stage of the model, while the compression and acceleration in the training
stage are few. We newly proposed a pruning algorithm that can compress and
accelerate in the training stage. It is novel to consider the amount of
pre-training required by pruning algorithm. Our results have implications: Too
much pre-training may be not necessary for pruning algorithms
Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers
Model pruning has become a useful technique that improves the computational
efficiency of deep learning, making it possible to deploy solutions in
resource-limited scenarios. A widely-used practice in relevant work assumes
that a smaller-norm parameter or feature plays a less informative role at the
inference time. In this paper, we propose a channel pruning technique for
accelerating the computations of deep convolutional neural networks (CNNs) that
does not critically rely on this assumption. Instead, it focuses on direct
simplification of the channel-to-channel computation graph of a CNN without the
need of performing a computationally difficult and not-always-useful task of
making high-dimensional tensors of CNN structured sparse. Our approach takes
two stages: first to adopt an end-to- end stochastic training method that
eventually forces the outputs of some channels to be constant, and then to
prune those constant channels from the original neural network by adjusting the
biases of their impacting layers such that the resulting compact model can be
quickly fine-tuned. Our approach is mathematically appealing from an
optimization perspective and easy to reproduce. We experimented our approach
through several image learning benchmarks and demonstrate its interesting
aspects and competitive performance.Comment: accepted to ICLR 2018, 11 page
Channel Pruning via Optimal Thresholding
Structured pruning, especially channel pruning is widely used for the reduced
computational cost and the compatibility with off-the-shelf hardware devices.
Among existing works, weights are typically removed using a predefined global
threshold, or a threshold computed from a predefined metric. The predefined
global threshold based designs ignore the variation among different layers and
weights distribution, therefore, they may often result in sub-optimal
performance caused by over-pruning or under-pruning. In this paper, we present
a simple yet effective method, termed Optimal Thresholding (OT), to prune
channels with layer dependent thresholds that optimally separate important from
negligible channels. By using OT, most negligible or unimportant channels are
pruned to achieve high sparsity while minimizing performance degradation. Since
most important weights are preserved, the pruned model can be further
fine-tuned and quickly converge with very few iterations. Our method
demonstrates superior performance, especially when compared to the
state-of-the-art designs at high levels of sparsity. On CIFAR-100, a pruned and
fine-tuned DenseNet-121 by using OT achieves 75.99% accuracy with only 1.46e8
FLOPs and 0.71M parameters.Comment: ICONIP 202
- …