14 research outputs found
PDP: Parameter-free Differentiable Pruning is All You Need
DNN pruning is a popular way to reduce the size of a model, improve the
inference latency, and minimize the power consumption on DNN accelerators.
However, existing approaches might be too complex, expensive or ineffective to
apply to a variety of vision/language tasks, DNN architectures and to honor
structured pruning constraints. In this paper, we propose an efficient yet
effective train-time pruning scheme, Parameter-free Differentiable Pruning
(PDP), which offers state-of-the-art qualities in model size, accuracy, and
training cost. PDP uses a dynamic function of weights during training to
generate soft pruning masks for the weights in a parameter-free manner for a
given pruning target. While differentiable, the simplicity and efficiency of
PDP make it universal enough to deliver state-of-the-art
random/structured/channel pruning results on various vision and natural
language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1
ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those
from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on
Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the
next best from the existing techniques shows 81.5% accuracy. In addition, PDP
can be applied to structured pruning, such as N:M pruning and channel pruning.
For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k
accuracy by over 3.6% over the state-of-the-art. For channel pruning of
ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the
state-of-the-art
Neural Network Pruning by Gradient Descent
The rapid increase in the parameters of deep learning models has led to
significant costs, challenging computational efficiency and model
interpretability. In this paper, we introduce a novel and straightforward
neural network pruning framework that incorporates the Gumbel-Softmax
technique. This framework enables the simultaneous optimization of a network's
weights and topology in an end-to-end process using stochastic gradient
descent. Empirical results demonstrate its exceptional compression capability,
maintaining high accuracy on the MNIST dataset with only 0.15\% of the original
network parameters. Moreover, our framework enhances neural network
interpretability, not only by allowing easy extraction of feature importance
directly from the pruned network but also by enabling visualization of feature
symmetry and the pathways of information propagation from features to outcomes.
Although the pruning strategy is learned through deep learning, it is
surprisingly intuitive and understandable, focusing on selecting key
representative features and exploiting data patterns to achieve extreme sparse
pruning. We believe our method opens a promising new avenue for deep learning
pruning and the creation of interpretable machine learning systems.Comment: 21 pages, 5 figure
Growing Efficient Deep Networks by Structured Continuous Sparsification
We develop an approach to training deep networks while dynamically adjusting
their architecture, driven by a principled combination of accuracy and sparsity
objectives. Unlike conventional pruning approaches, our method adopts a gradual
continuous relaxation of discrete network structure optimization and then
samples sparse subnetworks, enabling efficient deep networks to be trained in a
growing and pruning manner. Extensive experiments across CIFAR-10, ImageNet,
PASCAL VOC, and Penn Treebank, with convolutional models for image
classification and semantic segmentation, and recurrent models for language
modeling, show that our training scheme yields efficient networks that are
smaller and more accurate than those produced by competing pruning methods
Emerging Paradigms of Neural Network Pruning
Over-parameterization of neural networks benefits the optimization and
generalization yet brings cost in practice. Pruning is adopted as a
post-processing solution to this problem, which aims to remove unnecessary
parameters in a neural network with little performance compromised. It has been
broadly believed the resulted sparse neural network cannot be trained from
scratch to comparable accuracy. However, several recent works (e.g., [Frankle
and Carbin, 2019a]) challenge this belief by discovering random sparse networks
which can be trained to match the performance with their dense counterpart.
This new pruning paradigm later inspires more new methods of pruning at
initialization. In spite of the encouraging progress, how to coordinate these
new pruning fashions with the traditional pruning has not been explored yet.
This survey seeks to bridge the gap by proposing a general pruning framework so
that the emerging pruning paradigms can be accommodated well with the
traditional one. With it, we systematically reflect the major differences and
new insights brought by these new pruning fashions, with representative works
discussed at length. Finally, we summarize the open questions as worthy future
directions
SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization
Neural network pruning is a key technique towards engineering large yet
scalable, interpretable, and generalizable models. Prior work on the subject
has developed largely along two orthogonal directions: (1) differentiable
pruning for efficiently and accurately scoring the importance of parameters,
and (2) combinatorial optimization for efficiently searching over the space of
sparse models. We unite the two approaches, both theoretically and empirically,
to produce a coherent framework for structured neural network pruning in which
differentiable pruning guides combinatorial optimization algorithms to select
the most important sparse set of parameters. Theoretically, we show how many
existing differentiable pruning techniques can be understood as nonconvex
regularization for group sparse optimization, and prove that for a wide class
of nonconvex regularizers, the global optimum is unique, group-sparse, and
provably yields an approximate solution to a sparse convex optimization
problem. The resulting algorithm that we propose, SequentialAttention++,
advances the state of the art in large-scale neural network block-wise pruning
tasks on the ImageNet and Criteo datasets
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
The computer vision world has been re-gaining enthusiasm in various
pre-trained models, including both classical ImageNet supervised pre-training
and recently emerged self-supervised pre-training such as simCLR and MoCo.
Pre-trained weights often boost a wide range of downstream tasks including
classification, detection, and segmentation. Latest studies suggest that
pre-training benefits from gigantic model capacity. We are hereby curious and
ask: after pre-training, does a pre-trained model indeed have to stay large for
its downstream transferability?
In this paper, we examine supervised and self-supervised pre-trained models
through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly
sparse matching subnetworks that can be trained in isolation from (nearly)
scratch yet still reach the full models' performance. We extend the scope of
LTH and question whether matching subnetworks still exist in pre-trained
computer vision models, that enjoy the same downstream transfer performance.
Our extensive experiments convey an overall positive message: from all
pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we
are consistently able to locate such matching subnetworks at 59.04% to 96.48%
sparsity that transfer universally to multiple downstream tasks, whose
performance see no degradation compared to using full pre-trained weights.
Further analyses reveal that subnetworks found from different pre-training tend
to yield diverse mask structures and perturbation sensitivities. We conclude
that the core LTH observations remain generally relevant in the pre-training
paradigm of computer vision, but more delicate discussions are needed in some
cases. Codes and pre-trained models will be made available at:
https://github.com/VITA-Group/CV_LTH_Pre-training.Comment: CVPR 202