6,350 research outputs found
Zero-Shot Knowledge Distillation in Deep Networks
Knowledge distillation deals with the problem of training a smaller model
(Student) from a high capacity source model (Teacher) so as to retain most of
its performance. Existing approaches use either the training data or meta-data
extracted from it in order to train the Student. However, accessing the dataset
on which the Teacher has been trained may not always be feasible if the dataset
is very large or it poses privacy or safety concerns (e.g., bio-metric or
medical data). Hence, in this paper, we propose a novel data-free method to
train the Student from the Teacher. Without even using any meta-data, we
synthesize the Data Impressions from the complex Teacher model and utilize
these as surrogates for the original training data samples to transfer its
learning to Student via knowledge distillation. We, therefore, dub our method
"Zero-Shot Knowledge Distillation" and demonstrate that our framework results
in competitive generalization performance as achieved by distillation using the
actual training data samples on multiple benchmark datasets.Comment: Accepted in ICML 2019, codes will be available at
https://github.com/vcl-iisc/ZSK
Generative Low-Shot Network Expansion
Conventional deep learning classifiers are static in the sense that they are
trained on a predefined set of classes and learning to classify a novel class
typically requires re-training. In this work, we address the problem of
Low-Shot network expansion learning. We introduce a learning framework which
enables expanding a pre-trained (base) deep network to classify novel classes
when the number of examples for the novel classes is particularly small. We
present a simple yet powerful hard distillation method where the base network
is augmented with additional weights to classify the novel classes, while
keeping the weights of the base network unchanged. We show that since only a
small number of weights needs to be trained, the hard distillation excels in
low-shot training scenarios. Furthermore, hard distillation avoids detriment to
classification performance on the base classes. Finally, we show that low-shot
network expansion can be done with a very small memory footprint by using a
compact generative model of the base classes training data with only a
negligible degradation relative to learning with the full training set
Semantic-Aware Knowledge Preservation for Zero-Shot Sketch-Based Image Retrieval
Sketch-based image retrieval (SBIR) is widely recognized as an important
vision problem which implies a wide range of real-world applications. Recently,
research interests arise in solving this problem under the more realistic and
challenging setting of zero-shot learning. In this paper, we investigate this
problem from the viewpoint of domain adaptation which we show is critical in
improving feature embedding in the zero-shot scenario. Based on a framework
which starts with a pre-trained model on ImageNet and fine-tunes it on the
training set of SBIR benchmark, we advocate the importance of preserving
previously acquired knowledge, e.g., the rich discriminative features learned
from ImageNet, to improve the model's transfer ability. For this purpose, we
design an approach named Semantic-Aware Knowledge prEservation (SAKE), which
fine-tunes the pre-trained model in an economical way and leverages semantic
information, e.g., inter-class relationship, to achieve the goal of knowledge
preservation. Zero-shot experiments on two extended SBIR datasets, TU-Berlin
and Sketchy, verify the superior performance of our approach. Extensive
diagnostic experiments validate that knowledge preserved benefits SBIR in
zero-shot settings, as a large fraction of the performance gain is from the
more properly structured feature embedding for photo images. Code is available
at: https://github.com/qliu24/SAKE.Comment: To appear in ICCV 201
Structure-Level Knowledge Distillation For Multilingual Sequence Labeling
Multilingual sequence labeling is a task of predicting label sequences using
a single unified model for multiple languages. Compared with relying on
multiple monolingual models, using a multilingual model has the benefit of a
smaller model size, easier in online serving, and generalizability to
low-resource languages. However, current multilingual models still underperform
individual monolingual models significantly due to model capacity limitations.
In this paper, we propose to reduce the gap between monolingual models and the
unified multilingual model by distilling the structural knowledge of several
monolingual models (teachers) to the unified multilingual model (student). We
propose two novel KD methods based on structure-level information: (1)
approximately minimizes the distance between the student's and the teachers'
structure level probability distributions, (2) aggregates the structure-level
knowledge to local distributions and minimizes the distance between two local
probability distributions. Our experiments on 4 multilingual tasks with 25
datasets show that our approaches outperform several strong baselines and have
stronger zero-shot generalizability than both the baseline model and teacher
models.Comment: Accepted to ACL 2020, camera-ready. 14 page
Multi-layer Pruning Framework for Compressing Single Shot MultiBox Detector
We propose a framework for compressing state-of-the-art Single Shot MultiBox
Detector (SSD). The framework addresses compression in the following stages:
Sparsity Induction, Filter Selection, and Filter Pruning. In the Sparsity
Induction stage, the object detector model is sparsified via an improved global
threshold. In Filter Selection & Pruning stage, we select and remove filters
using sparsity statistics of filter weights in two consecutive convolutional
layers. This results in the model with the size smaller than most existing
compact architectures. We evaluate the performance of our framework with
multiple datasets and compare over multiple methods. Experimental results show
that our method achieves state-of-the-art compression of 6.7X and 4.9X on
PASCAL VOC dataset on models SSD300 and SSD512 respectively. We further show
that the method produces maximum compression of 26X with SSD512 on German
Traffic Sign Detection Benchmark (GTSDB). Additionally, we also empirically
show our method's adaptability for classification based architecture VGG16 on
datasets CIFAR and German Traffic Sign Recognition Benchmark (GTSRB) achieving
a compression rate of 125X and 200X with the reduction in flops by 90.50% and
96.6% respectively with no loss of accuracy. In addition to this, our method
does not require any special libraries or hardware support for the resulting
compressed models.Comment: IEEE Winter Conference on Applications of Computer Vision (WACV),
201
Visual Relationship Detection with Language prior and Softmax
Visual relationship detection is an intermediate image understanding task
that detects two objects and classifies a predicate that explains the
relationship between two objects in an image. The three components are
linguistically and visually correlated (e.g. "wear" is related to "person" and
"shirt", while "laptop" is related to "table" and "on") thus, the solution
space is huge because there are many possible cases between them. Language and
visual modules are exploited and a sophisticated spatial vector is proposed.
The models in this work outperformed the state of arts without costly
linguistic knowledge distillation from a large text corpus and building complex
loss functions. All experiments were only evaluated on Visual Relationship
Detection and Visual Genome dataset.Comment: 6 pages, 4 figure
Creating Lightweight Object Detectors with Model Compression for Deployment on Edge Devices
To achieve lightweight object detectors for deployment on the edge devices,
an effective model compression pipeline is proposed in this paper. The
compression pipeline consists of automatic channel pruning for the backbone,
fixed channel deletion for the branch layers and knowledge distillation for the
guidance learning. As results, the Resnet50-v1d is auto-pruned and fine-tuned
on ImageNet to attain a compact base model as the backbone of object detector.
Then, lightweight object detectors are implemented with proposed compression
pipeline. For instance, the SSD-300 with model size=16.3MB, FLOPS=2.31G, and
mAP=71.2 is created, revealing a better result than SSD-300-MobileNet.Comment: lightweight detector, automatic channel pruning, fixed channel
deletion, knowledge distillatio
Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning
As a step towards developing zero-shot task generalization capabilities in
reinforcement learning (RL), we introduce a new RL problem where the agent
should learn to execute sequences of instructions after learning useful skills
that solve subtasks. In this problem, we consider two types of generalizations:
to previously unseen instructions and to longer sequences of instructions. For
generalization over unseen instructions, we propose a new objective which
encourages learning correspondences between similar subtasks by making
analogies. For generalization over sequential instructions, we present a
hierarchical architecture where a meta controller learns to use the acquired
skills for executing the instructions. To deal with delayed reward, we propose
a new neural architecture in the meta controller that learns when to update the
subtask, which makes learning more efficient. Experimental results on a
stochastic 3D domain show that the proposed ideas are crucial for
generalization to longer instructions as well as unseen instructions.Comment: ICML 201
Learning to Learn: Meta-Critic Networks for Sample Efficient Learning
We propose a novel and flexible approach to meta-learning for
learning-to-learn from only a few examples. Our framework is motivated by
actor-critic reinforcement learning, but can be applied to both reinforcement
and supervised learning. The key idea is to learn a meta-critic: an
action-value function neural network that learns to criticise any actor trying
to solve any specified task. For supervised learning, this corresponds to the
novel idea of a trainable task-parametrised loss generator. This meta-critic
approach provides a route to knowledge transfer that can flexibly deal with
few-shot and semi-supervised conditions for both reinforcement and supervised
learning. Promising results are shown on both reinforcement and supervised
learning problems.Comment: Technical report, 12 pages, 3 figures, 2 table
Learning Metrics from Teachers: Compact Networks for Image Embedding
Metric learning networks are used to compute image embeddings, which are
widely used in many applications such as image retrieval and face recognition.
In this paper, we propose to use network distillation to efficiently compute
image embeddings with small networks. Network distillation has been
successfully applied to improve image classification, but has hardly been
explored for metric learning. To do so, we propose two new loss functions that
model the communication of a deep teacher network to a small student network.
We evaluate our system in several datasets, including CUB-200-2011, Cars-196,
Stanford Online Products and show that embeddings computed using small student
networks perform significantly better than those computed using standard
networks of similar size. Results on a very compact network (MobileNet-0.25),
which can be used on mobile devices, show that the proposed method can greatly
improve Recall@1 results from 27.5\% to 44.6\%. Furthermore, we investigate
various aspects of distillation for embeddings, including hint and attention
layers, semi-supervised learning and cross quality distillation. (Code is
available at https://github.com/yulu0724/EmbeddingDistillation.)Comment: To appear at CVPR 201
- …