221 research outputs found
network pruning via transformable architecture search
Network pruning reduces the computation costs of an over-parameterized
network without performance damage. Prevailing pruning algorithms pre-define
the width and depth of the pruned networks, and then transfer parameters from
the unpruned network to pruned networks. To break the structure limitation of
the pruned networks, we propose to apply neural architecture search to search
directly for a network with flexible channel and layer sizes. The number of the
channels/layers is learned by minimizing the loss of the pruned networks. The
feature map of the pruned network is an aggregation of K feature map fragments
(generated by K networks of different sizes), which are sampled based on the
probability distribution.The loss can be back-propagated not only to the
network weights, but also to the parameterized distribution to explicitly tune
the size of the channels/layers. Specifically, we apply channel-wise
interpolation to keep the feature map with different channel sizes aligned in
the aggregation procedure. The maximum probability for the size in each
distribution serves as the width and depth of the pruned network, whose
parameters are learned by knowledge transfer, e.g., knowledge distillation,
from the original networks. Experiments on CIFAR-10, CIFAR-100 and ImageNet
demonstrate the effectiveness of our new perspective of network pruning
compared to traditional network pruning algorithms. Various searching and
knowledge transfer approaches are conducted to show the effectiveness of the
two components. Code is at: https://github.com/D-X-Y/NAS-Projects.Comment: Published in the 33rd Conference on Neural Information Processing
Systems (NeurIPS 2019
RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices
Mobile devices are becoming an important carrier for deep learning tasks, as
they are being equipped with powerful, high-end mobile CPUs and GPUs. However,
it is still a challenging task to execute 3D Convolutional Neural Networks
(CNNs) targeting for real-time performance, besides high inference accuracy.
The reason is more complex model structure and higher model dimensionality
overwhelm the available computation/storage resources on mobile devices. A
natural way may be turning to deep learning weight pruning techniques. However,
the direct generalization of existing 2D CNN weight pruning methods to 3D CNNs
is not ideal for fully exploiting mobile parallelism while achieving high
inference accuracy.
This paper proposes RT3D, a model compression and mobile acceleration
framework for 3D CNNs, seamlessly integrating neural network weight pruning and
compiler code generation techniques. We propose and investigate two structured
sparsity schemes i.e., the vanilla structured sparsity and kernel group
structured (KGS) sparsity that are mobile acceleration friendly. The vanilla
sparsity removes whole kernel groups, while KGS sparsity is a more fine-grained
structured sparsity that enjoys higher flexibility while exploiting full
on-device parallelism. We propose a reweighted regularization pruning algorithm
to achieve the proposed sparsity schemes. The inference time speedup due to
sparsity is approaching the pruning rate of the whole model FLOPs (floating
point operations). RT3D demonstrates up to 29.1 speedup in end-to-end
inference time comparing with current mobile frameworks supporting 3D CNNs,
with moderate 1%-1.5% accuracy loss. The end-to-end inference time for 16 video
frames could be within 150 ms, when executing representative C3D and R(2+1)D
models on a cellphone. For the first time, real-time execution of 3D CNNs is
achieved on off-the-shelf mobiles.Comment: To appear in Proceedings of the 35th AAAI Conference on Artificial
Intelligence (AAAI-21
HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Model quantization is a widely used technique to compress and accelerate deep
neural network (DNN) inference. Emergent DNN hardware accelerators begin to
support mixed precision (1-8 bits) to further improve the computation
efficiency, which raises a great challenge to find the optimal bitwidth for
each layer: it requires domain experts to explore the vast design space trading
off among accuracy, latency, energy, and model size, which is both
time-consuming and sub-optimal. Conventional quantization algorithm ignores the
different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ)
framework which leverages the reinforcement learning to automatically determine
the quantization policy, and we take the hardware accelerator's feedback in the
design loop. Rather than relying on proxy signals such as FLOPs and model size,
we employ a hardware simulator to generate direct feedback signals (latency and
energy) to the RL agent. Compared with conventional methods, our framework is
fully automated and can specialize the quantization policy for different neural
network architectures and hardware architectures. Our framework effectively
reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with
negligible loss of accuracy compared with the fixed bitwidth (8 bits)
quantization. Our framework reveals that the optimal policies on different
hardware architectures (i.e., edge and cloud architectures) under different
resource constraints (i.e., latency, energy and model size) are drastically
different. We interpreted the implication of different quantization policies,
which offer insights for both neural network architecture design and hardware
architecture design.Comment: CVPR 2019. The first three authors contributed equally to this work.
Project page: https://hanlab.mit.edu/projects/haq
PR-DARTS: Pruning-Based Differentiable Architecture Search
The deployment of Convolutional Neural Networks (CNNs) on edge devices is
hindered by the substantial gap between performance requirements and available
processing power. While recent research has made large strides in developing
network pruning methods for reducing the computing overhead of CNNs, there
remains considerable accuracy loss, especially at high pruning ratios.
Questioning that the architectures designed for non-pruned networks might not
be effective for pruned networks, we propose to search architectures for
pruning methods by defining a new search space and a novel search objective. To
improve the generalization of the pruned networks, we propose two novel
PrunedConv and PrunedLinear operations. Specifically, these operations mitigate
the problem of unstable gradients by regularizing the objective function of the
pruned networks. The proposed search objective enables us to train architecture
parameters regarding the pruned weight elements. Quantitative analyses
demonstrate that our searched architectures outperform those used in the
state-of-the-art pruning networks on CIFAR-10 and ImageNet. In terms of
hardware effectiveness, PR-DARTS increases MobileNet-v2's accuracy from 73.44%
to 81.35% (+7.91% improvement) and runs 3.87 faster.Comment: 18 pages with 11 figure
Learning to Weight Samples for Dynamic Early-exiting Networks
Early exiting is an effective paradigm for improving the inference efficiency
of deep networks. By constructing classifiers with varying resource demands
(the exits), such networks allow easy samples to be output at early exits,
removing the need for executing deeper layers. While existing works mainly
focus on the architectural design of multi-exit networks, the training
strategies for such models are largely left unexplored. The current
state-of-the-art models treat all samples the same during training. However,
the early-exiting behavior during testing has been ignored, leading to a gap
between training and testing. In this paper, we propose to bridge this gap by
sample weighting. Intuitively, easy samples, which generally exit early in the
network during inference, should contribute more to training early classifiers.
The training of hard samples (mostly exit from deeper layers), however, should
be emphasized by the late classifiers. Our work proposes to adopt a weight
prediction network to weight the loss of different training samples at each
exit. This weight prediction network and the backbone model are jointly
optimized under a meta-learning framework with a novel optimization objective.
By bringing the adaptive behavior during inference into the training phase, we
show that the proposed weighting mechanism consistently improves the trade-off
between classification accuracy and inference efficiency. Code is available at
https://github.com/LeapLabTHU/L2W-DEN.Comment: ECCV 202
Automatic Network Adaptation for Ultra-Low Uniform-Precision Quantization
Uniform-precision neural network quantization has gained popularity since it
simplifies densely packed arithmetic unit for high computing capability.
However, it ignores heterogeneous sensitivity to the impact of quantization
errors across the layers, resulting in sub-optimal inference accuracy. This
work proposes a novel neural architecture search called neural channel
expansion that adjusts the network structure to alleviate accuracy degradation
from ultra-low uniform-precision quantization. The proposed method selectively
expands channels for the quantization sensitive layers while satisfying
hardware constraints (e.g., FLOPs, PARAMs). Based on in-depth analysis and
experiments, we demonstrate that the proposed method can adapt several popular
networks channels to achieve superior 2-bit quantization accuracy on CIFAR10
and ImageNet. In particular, we achieve the best-to-date Top-1/Top-5 accuracy
for 2-bit ResNet50 with smaller FLOPs and the parameter size.Comment: Accepted as a full paper by the TinyML Research Symposium 202
- …