589 research outputs found
Rethinking the Value of Network Pruning
Network pruning is widely used for reducing the heavy inference cost of deep
models in low-resource settings. A typical pruning algorithm is a three-stage
pipeline, i.e., training (a large model), pruning and fine-tuning. During
pruning, according to a certain criterion, redundant weights are pruned and
important weights are kept to best preserve the accuracy. In this work, we make
several surprising observations which contradict common beliefs. For all
state-of-the-art structured pruning algorithms we examined, fine-tuning a
pruned model only gives comparable or worse performance than training that
model with randomly initialized weights. For pruning algorithms which assume a
predefined target network architecture, one can get rid of the full pipeline
and directly train the target network from scratch. Our observations are
consistent for multiple network architectures, datasets, and tasks, which imply
that: 1) training a large, over-parameterized model is often not necessary to
obtain an efficient final model, 2) learned "important" weights of the large
model are typically not useful for the small pruned model, 3) the pruned
architecture itself, rather than a set of inherited "important" weights, is
more crucial to the efficiency in the final model, which suggests that in some
cases pruning can be useful as an architecture search paradigm. Our results
suggest the need for more careful baseline evaluations in future research on
structured pruning methods. We also compare with the "Lottery Ticket
Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate,
the "winning ticket" initialization as used in Frankle & Carbin (2019) does not
bring improvement over random initialization.Comment: ICLR 2019. Significant revisions from the previous versio
JavaScript Convolutional Neural Networks for Keyword Spotting in the Browser: An Experimental Analysis
Used for simple commands recognition on devices from smart routers to mobile
phones, keyword spotting systems are everywhere. Ubiquitous as well are web
applications, which have grown in popularity and complexity over the last
decade with significant improvements in usability under cross-platform
conditions. However, despite their obvious advantage in natural language
interaction, voice-enabled web applications are still far and few between. In
this work, we attempt to bridge this gap by bringing keyword spotting
capabilities directly into the browser. To our knowledge, we are the first to
demonstrate a fully-functional implementation of convolutional neural networks
in pure JavaScript that runs in any standards-compliant browser. We also apply
network slimming, a model compression technique, to explore the
accuracy-efficiency tradeoffs, reporting latency measurements on a range of
devices and software. Overall, our robust, cross-device implementation for
keyword spotting realizes a new paradigm for serving neural network
applications, and one of our slim models reduces latency by 66% with a minimal
decrease in accuracy of 4% from 94% to 90%.Comment: 5 pages, 3 figure
DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices
Deploying deep neural networks on mobile devices is a challenging task.
Current model compression methods such as matrix decomposition effectively
reduce the deployed model size, but still cannot satisfy real-time processing
requirement. This paper first discovers that the major obstacle is the
excessive execution time of non-tensor layers such as pooling and normalization
without tensor-like trainable parameters. This motivates us to design a novel
acceleration framework: DeepRebirth through "slimming" existing consecutive and
parallel non-tensor and tensor layers. The layer slimming is executed at
different substructures: (a) streamline slimming by merging the consecutive
non-tensor and tensor layer vertically; (b) branch slimming by merging
non-tensor and tensor branches horizontally. The proposed optimization
operations significantly accelerate the model execution and also greatly reduce
the run-time memory cost since the slimmed model architecture contains less
hidden layers. To maximally avoid accuracy loss, the parameters in new
generated layers are learned with layer-wise fine-tuning based on both
theoretical analysis and empirical verification. As observed in the experiment,
DeepRebirth achieves more than 3x speed-up and 2.5x run-time memory saving on
GoogLeNet with only 0.4% drop of top-5 accuracy on ImageNet. Furthermore, by
combining with other model compression techniques, DeepRebirth offers an
average of 65ms inference time on the CPU of Samsung Galaxy S6 with 86.5% top-5
accuracy, 14% faster than SqueezeNet which only has a top-5 accuracy of 80.5%.Comment: AAAI 201
Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure
The redundancy is widely recognized in Convolutional Neural Networks (CNNs),
which enables to remove unimportant filters from convolutional layers so as to
slim the network with acceptable performance drop. Inspired by the linear and
combinational properties of convolution, we seek to make some filters
increasingly close and eventually identical for network slimming. To this end,
we propose Centripetal SGD (C-SGD), a novel optimization method, which can
train several filters to collapse into a single point in the parameter
hyperspace. When the training is completed, the removal of the identical
filters can trim the network with NO performance loss, thus no finetuning is
needed. By doing so, we have partly solved an open problem of constrained
filter pruning on CNNs with complicated structure, where some layers must be
pruned following others. Our experimental results on CIFAR-10 and ImageNet have
justified the effectiveness of C-SGD-based filter pruning. Moreover, we have
provided empirical evidences for the assumption that the redundancy in deep
neural networks helps the convergence of training by showing that a redundant
CNN trained using C-SGD outperforms a normally trained counterpart with the
equivalent width.Comment: CVPR 201
Creating Lightweight Object Detectors with Model Compression for Deployment on Edge Devices
To achieve lightweight object detectors for deployment on the edge devices,
an effective model compression pipeline is proposed in this paper. The
compression pipeline consists of automatic channel pruning for the backbone,
fixed channel deletion for the branch layers and knowledge distillation for the
guidance learning. As results, the Resnet50-v1d is auto-pruned and fine-tuned
on ImageNet to attain a compact base model as the backbone of object detector.
Then, lightweight object detectors are implemented with proposed compression
pipeline. For instance, the SSD-300 with model size=16.3MB, FLOPS=2.31G, and
mAP=71.2 is created, revealing a better result than SSD-300-MobileNet.Comment: lightweight detector, automatic channel pruning, fixed channel
deletion, knowledge distillatio
AutoSlim: Towards One-Shot Architecture Search for Channel Numbers
We study how to set channel numbers in a neural network to achieve better
accuracy under constrained resources (e.g., FLOPs, latency, memory footprint or
model size). A simple and one-shot solution, named AutoSlim, is presented.
Instead of training many network samples and searching with reinforcement
learning, we train a single slimmable network to approximate the network
accuracy of different channel configurations. We then iteratively evaluate the
trained slimmable model and greedily slim the layer with minimal accuracy drop.
By this single pass, we can obtain the optimized channel configurations under
different resource constraints. We present experiments with MobileNet v1,
MobileNet v2, ResNet-50 and RL-searched MNasNet on ImageNet classification. We
show significant improvements over their default channel configurations. We
also achieve better accuracy than recent channel pruning methods and neural
architecture search methods.
Notably, by setting optimized channel numbers, our AutoSlim-MobileNet-v2 at
305M FLOPs achieves 74.2% top-1 accuracy, 2.4% better than default MobileNet-v2
(301M FLOPs), and even 0.2% better than RL-searched MNasNet (317M FLOPs). Our
AutoSlim-ResNet-50 at 570M FLOPs, without depthwise convolutions, achieves 1.3%
better accuracy than MobileNet-v1 (569M FLOPs). Code and models will be
available at: https://github.com/JiahuiYu/slimmable_networksComment: tech repor
C2S2: Cost-aware Channel Sparse Selection for Progressive Network Pruning
This paper describes a channel-selection approach for simplifying deep neural
networks. Specifically, we propose a new type of generic network layer, called
pruning layer, to seamlessly augment a given pre-trained model for compression.
Each pruning layer, comprising depth-wise kernels, is represented
with a dual format: one is real-valued and the other is binary. The former
enables a two-phase optimization process of network pruning to operate with an
end-to-end differentiable network, and the latter yields the mask information
for channel selection. Our method progressively performs the pruning task
layer-wise, and achieves channel selection according to a sparsity criterion to
favor pruning more channels. We also develop a cost-aware mechanism to prevent
the compression from sacrificing the expected network performance. Our results
for compressing several benchmark deep networks on image classification and
semantic segmentation are comparable to those by state-of-the-art
Dynamic Routing Networks
The deployment of deep neural networks in real-world applications is mostly
restricted by their high inference costs. Extensive efforts have been made to
improve the accuracy with expert-designed or algorithm-searched architectures.
However, the incremental improvement is typically achieved with increasingly
more expensive models that only a small portion of input instances really need.
Inference with a static architecture that processes all input instances via the
same transformation would thus incur unnecessary computational costs.
Therefore, customizing the model capacity in an instance-aware manner is much
needed for higher inference efficiency. In this paper, we propose Dynamic
Routing Networks (DRNets), which support efficient instance-aware inference by
routing the input instance to only necessary transformation branches selected
from a candidate set of branches for each connection between transformation
nodes. The branch selection is dynamically determined via the corresponding
branch importance weights, which are first generated from lightweight
hypernetworks (RouterNets) and then recalibrated with Gumbel-Softmax before the
selection. Extensive experiments show that DRNets can reduce a substantial
amount of parameter size and FLOPs during inference with prediction performance
comparable to state-of-the-art architectures.Comment: 10 pages, 3 figures, 3 table
Dynamic Channel Pruning: Feature Boosting and Suppression
Making deep convolutional neural networks more accurate typically comes at
the cost of increased computational and memory resources. In this paper, we
reduce this cost by exploiting the fact that the importance of features
computed by convolutional layers is highly input-dependent, and propose feature
boosting and suppression (FBS), a new method to predictively amplify salient
convolutional channels and skip unimportant ones at run-time. FBS introduces
small auxiliary connections to existing convolutional layers. In contrast to
channel pruning methods which permanently remove channels, it preserves the
full network structures and accelerates convolution by dynamically skipping
unimportant input and output channels. FBS-augmented networks are trained with
conventional stochastic gradient descent, making it readily available for many
state-of-the-art CNNs. We compare FBS to a range of existing channel pruning
and dynamic execution schemes and demonstrate large improvements on ImageNet
classification. Experiments show that FBS can respectively provide
and savings in compute on VGG-16 and ResNet-18, both with less than
top-5 accuracy loss.Comment: 14 pages, 5 figures, 4 tables, published as a conference paper at
ICLR 201
Model Slicing for Supporting Complex Analytics with Elastic Inference Cost and Resource Constraints
Deep learning models have been used to support analytics beyond simple
aggregation, where deeper and wider models have been shown to yield great
results. These models consume a huge amount of memory and computational
operations. However, most of the large-scale industrial applications are often
computational budget constrained. In practice, the peak workload of inference
service could be 10x higher than the average cases, with the presence of
unpredictable extreme cases. Lots of computational resources could be wasted
during off-peak hours and the system may crash when the workload exceeds system
capacity. How to support deep learning services with dynamic workload
cost-efficiently remains a challenging problem. In this paper, we address the
challenge with a general and novel training scheme called model slicing, which
enables deep learning models to provide predictions within the prescribed
computational resource budget dynamically. Model slicing could be viewed as an
elastic computation solution without requiring more computational resources.
Succinctly, each layer in the model is divided into groups of contiguous block
of basic components (i.e. neurons in dense layers and channels in convolutional
layers), and then partially ordered relation is introduced to these groups by
enforcing that groups participated in each forward pass always starts from the
first group to the dynamically-determined rightmost group. Trained by
dynamically indexing the rightmost group with a single parameter slice rate,
the network is engendered to build up group-wise and residual representation.
Then during inference, a sub-model with fewer groups can be readily deployed
for efficiency whose computation is roughly quadratic to the width controlled
by the slice rate. Extensive experiments show that models trained with model
slicing can effectively support on-demand workload with elastic inference cost.Comment: 14 pages, 8 figures. arXiv admin note: text overlap with
arXiv:1706.02093 by other author
- …