2,287 research outputs found
Automated Pruning for Deep Neural Network Compression
In this work we present a method to improve the pruning step of the current
state-of-the-art methodology to compress neural networks. The novelty of the
proposed pruning technique is in its differentiability, which allows pruning to
be performed during the backpropagation phase of the network training. This
enables an end-to-end learning and strongly reduces the training time. The
technique is based on a family of differentiable pruning functions and a new
regularizer specifically designed to enforce pruning. The experimental results
show that the joint optimization of both the thresholds and the network weights
permits to reach a higher compression rate, reducing the number of weights of
the pruned network by a further 14% to 33% compared to the current
state-of-the-art. Furthermore, we believe that this is the first study where
the generalization capabilities in transfer learning tasks of the features
extracted by a pruned network are analyzed. To achieve this goal, we show that
the representations learned using the proposed pruning methodology maintain the
same effectiveness and generality of those learned by the corresponding
non-compressed network on a set of different recognition tasks.Comment: 8 pages, 5 figures. Published as a conference paper at ICPR 201
Auto Deep Compression by Reinforcement Learning Based Actor-Critic Structure
Model-based compression is an effective, facilitating, and expanded model of
neural network models with limited computing and low power. However,
conventional models of compression techniques utilize crafted features [2,3,12]
and explore specialized areas for exploration and design of large spaces in
terms of size, speed, and accuracy, which usually have returns Less and time is
up. This paper will effectively analyze deep auto compression (ADC) and
reinforcement learning strength in an effective sample and space design, and
improve the compression quality of the model. The results of compression of the
advanced model are obtained without any human effort and in a completely
automated way. With a 4- fold reduction in FLOP, the accuracy of 2.8% is higher
than the manual compression model for VGG-16 in ImageNet
CoCoPIE: Making Mobile AI Sweet As PIE --Compression-Compilation Co-Design Goes a Long Way
Assuming hardware is the major constraint for enabling real-time mobile
intelligence, the industry has mainly dedicated their efforts to developing
specialized hardware accelerators for machine learning and inference. This
article challenges the assumption. By drawing on a recent real-time AI
optimization framework CoCoPIE, it maintains that with effective
compression-compiler co-design, it is possible to enable real-time artificial
intelligence on mainstream end devices without special hardware. CoCoPIE is a
software framework that holds numerous records on mobile AI: the first
framework that supports all main kinds of DNNs, from CNNs to RNNs, transformer,
language models, and so on; the fastest DNN pruning and acceleration framework,
up to 180X faster compared with current DNN pruning on other frameworks such as
TensorFlow-Lite; making many representative AI applications able to run in
real-time on off-the-shelf mobile devices that have been previously regarded
possible only with special hardware support; making off-the-shelf mobile
devices outperform a number of representative ASIC and FPGA solutions in terms
of energy efficiency and/or performance
A Survey of Model Compression and Acceleration for Deep Neural Networks
Deep neural networks (DNNs) have recently achieved great success in many
visual recognition tasks. However, existing deep neural network models are
computationally expensive and memory intensive, hindering their deployment in
devices with low memory resources or in applications with strict latency
requirements. Therefore, a natural thought is to perform model compression and
acceleration in deep networks without significantly decreasing the model
performance. During the past five years, tremendous progress has been made in
this area. In this paper, we review the recent techniques for compacting and
accelerating DNN models. In general, these techniques are divided into four
categories: parameter pruning and quantization, low-rank factorization,
transferred/compact convolutional filters, and knowledge distillation. Methods
of parameter pruning and quantization are described first, after that the other
techniques are introduced. For each category, we also provide insightful
analysis about the performance, related applications, advantages, and
drawbacks. Then we go through some very recent successful methods, for example,
dynamic capacity networks and stochastic depths networks. After that, we survey
the evaluation matrices, the main datasets used for evaluating the model
performance, and recent benchmark efforts. Finally, we conclude this paper,
discuss remaining the challenges and possible directions for future work.Comment: Published in IEEE Signal Processing Magazine, updated version
including more recent work
HSD-CNN: Hierarchically self decomposing CNN architecture using class specific filter sensitivity analysis
Conventional Convolutional neural networks (CNN) are trained on large domain
datasets and are hence typically over-represented and inefficient in limited
class applications. An efficient way to convert such large many-class
pre-trained networks into small few-class networks is through a hierarchical
decomposition of its feature maps. To alleviate this issue, we propose an
automated framework for such decomposition in Hierarchically Self Decomposing
CNN (HSD-CNN), in four steps. HSD-CNN is derived automatically using a
class-specific filter sensitivity analysis that quantifies the impact of
specific features on a class prediction. The decomposed hierarchical network
can be utilized and deployed directly to obtain sub-networks for a subset of
classes, and it is shown to perform better without the requirement of
retraining these sub-networks. Experimental results show that HSD-CNN generally
does not degrade accuracy if the full set of classes are used. Interestingly,
when operating on known subsets of classes, HSD-CNN has an improvement in
accuracy with a much smaller model size, requiring much fewer operations.
HSD-CNN flow is verified on the CIFAR10, CIFAR100 and CALTECH101 data sets. We
report accuracies up to ( ) on scenarios with 13 ( 4 )
classes of CIFAR100, using a pre-trained VGG-16 network on the full data set.
In this case, the proposed HSD-CNN requires fewer parameters and
has savings in operations, in comparison to baseline VGG-16
containing features for all 100 classes.Comment: Accepted in ICVGIP,201
Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration
Previous works utilized ''smaller-norm-less-important'' criterion to prune
filters with smaller norm values in a convolutional neural network. In this
paper, we analyze this norm-based criterion and point out that its
effectiveness depends on two requirements that are not always met: (1) the norm
deviation of the filters should be large; (2) the minimum norm of the filters
should be small. To solve this problem, we propose a novel filter pruning
method, namely Filter Pruning via Geometric Median (FPGM), to compress the
model regardless of those two requirements. Unlike previous methods, FPGM
compresses CNN models by pruning filters with redundancy, rather than those
with ''relatively less'' importance. When applied to two image classification
benchmarks, our method validates its usefulness and strengths. Notably, on
CIFAR-10, FPGM reduces more than 52% FLOPs on ResNet-110 with even 2.69%
relative accuracy improvement. Moreover, on ILSVRC-2012, FPGM reduces more than
42% FLOPs on ResNet-101 without top-5 accuracy drop, which has advanced the
state-of-the-art. Code is publicly available on GitHub:
https://github.com/he-y/filter-pruning-geometric-medianComment: Accepted to CVPR 2019 (Oral
Domain-adaptive deep network compression
Deep Neural Networks trained on large datasets can be easily transferred to
new domains with far fewer labeled examples by a process called fine-tuning.
This has the advantage that representations learned in the large source domain
can be exploited on smaller target domains. However, networks designed to be
optimal for the source task are often prohibitively large for the target task.
In this work we address the compression of networks after domain transfer.
We focus on compression algorithms based on low-rank matrix decomposition.
Existing methods base compression solely on learned network weights and ignore
the statistics of network activations. We show that domain transfer leads to
large shifts in network activations and that it is desirable to take this into
account when compressing. We demonstrate that considering activation statistics
when compressing weights leads to a rank-constrained regression problem with a
closed-form solution. Because our method takes into account the target domain,
it can more optimally remove the redundancy in the weights. Experiments show
that our Domain Adaptive Low Rank (DALR) method significantly outperforms
existing low-rank compression techniques. With our approach, the fc6 layer of
VGG19 can be compressed more than 4x more than using truncated SVD alone --
with only a minor or no loss in accuracy. When applied to domain-transferred
networks it allows for compression down to only 5-20% of the original number of
parameters with only a minor drop in performance.Comment: Accepted at ICCV 201
Incremental Learning Using a Grow-and-Prune Paradigm with Efficient Neural Networks
Deep neural networks (DNNs) have become a widely deployed model for numerous
machine learning applications. However, their fixed architecture, substantial
training cost, and significant model redundancy make it difficult to
efficiently update them to accommodate previously unseen data. To solve these
problems, we propose an incremental learning framework based on a
grow-and-prune neural network synthesis paradigm. When new data arrive, the
neural network first grows new connections based on the gradients to increase
the network capacity to accommodate new data. Then, the framework iteratively
prunes away connections based on the magnitude of weights to enhance network
compactness, and hence recover efficiency. Finally, the model rests at a
lightweight DNN that is both ready for inference and suitable for future
grow-and-prune updates. The proposed framework improves accuracy, shrinks
network size, and significantly reduces the additional training cost for
incoming data compared to conventional approaches, such as training from
scratch and network fine-tuning. For the LeNet-300-100 and LeNet-5 neural
network architectures derived for the MNIST dataset, the framework reduces
training cost by up to 64% (63%) and 67% (63%) compared to training from
scratch (network fine-tuning), respectively. For the ResNet-18 architecture
derived for the ImageNet dataset and DeepSpeech2 for the AN4 dataset, the
corresponding training cost reductions against training from scratch (network
fine-tunning) are 64% (60%) and 67% (62%), respectively. Our derived models
contain fewer network parameters but achieve higher accuracy relative to
conventional baselines
Meta Filter Pruning to Accelerate Deep Convolutional Neural Networks
Existing methods usually utilize pre-defined criterions, such as p-norm, to
prune unimportant filters. There are two major limitations in these methods.
First, the relations of the filters are largely ignored. The filters usually
work jointly to make an accurate prediction in a collaborative way. Similar
filters will have equivalent effects on the network prediction, and the
redundant filters can be further pruned. Second, the pruning criterion remains
unchanged during training. As the network updated at each iteration, the filter
distribution also changes continuously. The pruning criterions should also be
adaptively switched. In this paper, we propose Meta Filter Pruning (MFP) to
solve the above problems. First, as a complement to the existing p-norm
criterion, we introduce a new pruning criterion considering the filter relation
via filter distance. Additionally, we build a meta pruning framework for filter
pruning, so that our method could adaptively select the most appropriate
pruning criterion as the filter distribution changes. Experiments validate our
approach on two image classification benchmarks. Notably, on ILSVRC-2012, our
MFP reduces more than 50% FLOPs on ResNet-50 with only 0.44% top-5 accuracy
loss.Comment: 10 page
Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM
Many long short-term memory (LSTM) applications need fast yet compact models.
Neural network compression approaches, such as the grow-and-prune paradigm,
have proved to be promising for cutting down network complexity by skipping
insignificant weights. However, current compression strategies are mostly
hardware-agnostic and network complexity reduction does not always translate
into execution efficiency. In this work, we propose a hardware-guided symbiotic
training methodology for compact, accurate, yet execution-efficient inference
models. It is based on our observation that hardware may introduce substantial
non-monotonic behavior, which we call the latency hysteresis effect, when
evaluating network size vs. inference latency. This observation raises question
about the mainstream smaller-dimension-is-better compression strategy, which
often leads to a sub-optimal model architecture. By leveraging the
hardware-impacted hysteresis effect and sparsity, we are able to achieve the
symbiosis of model compactness and accuracy with execution efficiency, thus
reducing LSTM latency while increasing its accuracy. We have evaluated our
algorithms on language modeling and speech recognition applications. Relative
to the traditional stacked LSTM architecture obtained for the Penn Treebank
dataset, we reduce the number of parameters by 18.0x (30.5x) and measured
run-time latency by up to 2.4x (5.2x) on Nvidia GPUs (Intel Xeon CPUs) without
any accuracy degradation. For the DeepSpeech2 architecture obtained for the AN4
dataset, we reduce the number of parameters by 7.0x (19.4x), word error rate
from 12.9% to 9.9% (10.4%), and measured run-time latency by up to 1.7x (2.4x)
on Nvidia GPUs (Intel Xeon CPUs). Thus, our method yields compact, accurate,
yet execution-efficient inference models
- …