1,196 research outputs found
On Compressing U-net Using Knowledge Distillation
We study the use of knowledge distillation to compress the U-net
architecture. We show that, while standard distillation is not sufficient to
reliably train a compressed U-net, introducing other regularization methods,
such as batch normalization and class re-weighting, in knowledge distillation
significantly improves the training process. This allows us to compress a U-net
by over 1000x, i.e., to 0.1% of its original number of parameters, at a
negligible decrease in performance.Comment: 4 pages, 1 figur
On Compressing U-net Using Knowledge Distillation
We study the use of knowledge distillation to compress the U-net
architecture. We show that, while standard distillation is not sufficient to
reliably train a compressed U-net, introducing other regularization methods,
such as batch normalization and class re-weighting, in knowledge distillation
significantly improves the training process. This allows us to compress a U-net
by over 1000x, i.e., to 0.1% of its original number of parameters, at a
negligible decrease in performance.Comment: 4 pages, 1 figur
A Survey of Model Compression and Acceleration for Deep Neural Networks
Deep neural networks (DNNs) have recently achieved great success in many
visual recognition tasks. However, existing deep neural network models are
computationally expensive and memory intensive, hindering their deployment in
devices with low memory resources or in applications with strict latency
requirements. Therefore, a natural thought is to perform model compression and
acceleration in deep networks without significantly decreasing the model
performance. During the past five years, tremendous progress has been made in
this area. In this paper, we review the recent techniques for compacting and
accelerating DNN models. In general, these techniques are divided into four
categories: parameter pruning and quantization, low-rank factorization,
transferred/compact convolutional filters, and knowledge distillation. Methods
of parameter pruning and quantization are described first, after that the other
techniques are introduced. For each category, we also provide insightful
analysis about the performance, related applications, advantages, and
drawbacks. Then we go through some very recent successful methods, for example,
dynamic capacity networks and stochastic depths networks. After that, we survey
the evaluation matrices, the main datasets used for evaluating the model
performance, and recent benchmark efforts. Finally, we conclude this paper,
discuss remaining the challenges and possible directions for future work.Comment: Published in IEEE Signal Processing Magazine, updated version
including more recent work
Accelerating Large Scale Knowledge Distillation via Dynamic Importance Sampling
Knowledge distillation is an effective technique that transfers knowledge
from a large teacher model to a shallow student. However, just like massive
classification, large scale knowledge distillation also imposes heavy
computational costs on training models of deep neural networks, as the softmax
activations at the last layer involve computing probabilities over numerous
classes. In this work, we apply the idea of importance sampling which is often
used in Neural Machine Translation on large scale knowledge distillation. We
present a method called dynamic importance sampling, where ranked classes are
sampled from a dynamic distribution derived from the interaction between the
teacher and student in full distillation. We highlight the utility of our
proposal prior which helps the student capture the main information in the loss
function. Our approach manages to reduce the computational cost at training
time while maintaining the competitive performance on CIFAR-100 and Market-1501
person re-identification datasets
Model Compression with Adversarial Robustness: A Unified Optimization Framework
Deep model compression has been extensively studied, and state-of-the-art
methods can now achieve high compression ratios with minimal accuracy loss.
This paper studies model compression through a different lens: could we
compress models without hurting their robustness to adversarial attacks, in
addition to maintaining accuracy? Previous literature suggested that the goals
of robustness and compactness might sometimes contradict. We propose a novel
Adversarially Trained Model Compression (ATMC) framework. ATMC constructs a
unified constrained optimization formulation, where existing compression means
(pruning, factorization, quantization) are all integrated into the constraints.
An efficient algorithm is then developed. An extensive group of experiments are
presented, demonstrating that ATMC obtains remarkably more favorable trade-off
among model size, accuracy and robustness, over currently available
alternatives in various settings. The codes are publicly available at:
https://github.com/shupenggui/ATMC.Comment: 14 pages, NeurIPS 2019. The first two authors Gui and Wang
contributed equally and are listed alphabeticall
SlimNets: An Exploration of Deep Model Compression and Acceleration
Deep neural networks have achieved increasingly accurate results on a wide
variety of complex tasks. However, much of this improvement is due to the
growing use and availability of computational resources (e.g use of GPUs, more
layers, more parameters, etc). Most state-of-the-art deep networks, despite
performing well, over-parameterize approximate functions and take a significant
amount of time to train. With increased focus on deploying deep neural networks
on resource constrained devices like smart phones, there has been a push to
evaluate why these models are so resource hungry and how they can be made more
efficient. This work evaluates and compares three distinct methods for deep
model compression and acceleration: weight pruning, low rank factorization, and
knowledge distillation. Comparisons on VGG nets trained on CIFAR10 show that
each of the models on their own are effective, but that the true power lies in
combining them. We show that by combining pruning and knowledge distillation
methods we can create a compressed network 85 times smaller than the original,
all while retaining 96% of the original model's accuracy.Comment: To be published in 2018 IEEE High Performance Extreme Computing
Conference (HPEC
On the Compression of Recurrent Neural Networks with an Application to LVCSR acoustic modeling for Embedded Speech Recognition
We study the problem of compressing recurrent neural networks (RNNs). In
particular, we focus on the compression of RNN acoustic models, which are
motivated by the goal of building compact and accurate speech recognition
systems which can be run efficiently on mobile devices. In this work, we
present a technique for general recurrent model compression that jointly
compresses both recurrent and non-recurrent inter-layer weight matrices. We
find that the proposed technique allows us to reduce the size of our Long
Short-Term Memory (LSTM) acoustic model to a third of its original size with
negligible loss in accuracy.Comment: Accepted in ICASSP 201
Distilling portable Generative Adversarial Networks for Image Translation
Despite Generative Adversarial Networks (GANs) have been widely used in
various image-to-image translation tasks, they can be hardly applied on mobile
devices due to their heavy computation and storage cost. Traditional network
compression methods focus on visually recognition tasks, but never deal with
generation tasks. Inspired by knowledge distillation, a student generator of
fewer parameters is trained by inheriting the low-level and high-level
information from the original heavy teacher generator. To promote the
capability of student generator, we include a student discriminator to measure
the distances between real images, and images generated by student and teacher
generators. An adversarial learning process is therefore established to
optimize student generator and student discriminator. Qualitative and
quantitative analysis by conducting experiments on benchmark datasets
demonstrate that the proposed method can learn portable generative models with
strong performance
Policy Distillation
Policies for complex visual tasks have been successfully learned with deep
reinforcement learning, using an approach called deep Q-networks (DQN), but
relatively large (task-specific) networks and extensive training are needed to
achieve good performance. In this work, we present a novel method called policy
distillation that can be used to extract the policy of a reinforcement learning
agent and train a new network that performs at the expert level while being
dramatically smaller and more efficient. Furthermore, the same method can be
used to consolidate multiple task-specific policies into a single policy. We
demonstrate these claims using the Atari domain and show that the multi-task
distilled agent outperforms the single-task teachers as well as a
jointly-trained DQN agent.Comment: Submitted to ICLR 201
Data-Free Network Quantization With Adversarial Knowledge Distillation
Network quantization is an essential procedure in deep learning for
development of efficient fixed-point inference models on mobile or edge
platforms. However, as datasets grow larger and privacy regulations become
stricter, data sharing for model compression gets more difficult and
restricted. In this paper, we consider data-free network quantization with
synthetic data. The synthetic data are generated from a generator, while no
data are used in training the generator and in quantization. To this end, we
propose data-free adversarial knowledge distillation, which minimizes the
maximum distance between the outputs of the teacher and the (quantized) student
for any adversarial samples from a generator. To generate adversarial samples
similar to the original data, we additionally propose matching statistics from
the batch normalization layers for generated data and the original data in the
teacher. Furthermore, we show the gain of producing diverse adversarial samples
by using multiple generators and multiple students. Our experiments show the
state-of-the-art data-free model compression and quantization results for
(wide) residual networks and MobileNet on SVHN, CIFAR-10, CIFAR-100, and
Tiny-ImageNet datasets. The accuracy losses compared to using the original
datasets are shown to be very minimal.Comment: CVPR 2020 Joint Workshop on Efficient Deep Learning in Computer
Vision (EDLCV
- …