6,623 research outputs found
Large scale distributed neural network training through online distillation
Techniques such as ensembling and distillation promise model quality
improvements when paired with almost any base model. However, due to increased
test-time cost (for ensembles) and increased complexity of the training
pipeline (for distillation), these techniques are challenging to use in
industrial settings. In this paper we explore a variant of distillation which
is relatively straightforward to use as it does not require a complicated
multi-stage setup or many new hyperparameters. Our first claim is that online
distillation enables us to use extra parallelism to fit very large datasets
about twice as fast. Crucially, we can still speed up training even after we
have already reached the point at which additional parallelism provides no
benefit for synchronous or asynchronous stochastic gradient descent. Two neural
networks trained on disjoint subsets of the data can share knowledge by
encouraging each model to agree with the predictions the other model would have
made. These predictions can come from a stale version of the other model so
they can be safely computed using weights that only rarely get transmitted. Our
second claim is that online distillation is a cost-effective way to make the
exact predictions of a model dramatically more reproducible. We support our
claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet,
and the largest to-date dataset used for neural language modeling, containing
tokens and based on the Common Crawl repository of web data.Comment: Clarify that implementations should use available parallelism in
pseudo-cod
Knowledge Distillation by On-the-Fly Native Ensemble
Knowledge distillation is effective to train small and generalisable network
models for meeting the low-memory and fast running requirements. Existing
offline distillation methods rely on a strong pre-trained teacher, which
enables favourable knowledge discovery and transfer but requires a complex
two-phase training procedure. Online counterparts address this limitation at
the price of lacking a highcapacity teacher. In this work, we present an
On-the-fly Native Ensemble (ONE) strategy for one-stage online distillation.
Specifically, ONE trains only a single multi-branch network while
simultaneously establishing a strong teacher on-the- fly to enhance the
learning of target network. Extensive evaluations show that ONE improves the
generalisation performance a variety of deep neural networks more significantly
than alternative methods on four image classification dataset: CIFAR10,
CIFAR100, SVHN, and ImageNet, whilst having the computational efficiency
advantages.Comment: To appear in NIPS201
Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System
We propose a novel way to train ranking models, such as recommender systems,
that are both effective and efficient. Knowledge distillation (KD) was shown to
be successful in image recognition to achieve both effectiveness and
efficiency. We propose a KD technique for learning to rank problems, called
\emph{ranking distillation (RD)}. Specifically, we train a smaller student
model to learn to rank documents/items from both the training data and the
supervision of a larger teacher model. The student model achieves a similar
ranking performance to that of the large teacher model, but its smaller model
size makes the online inference more efficient. RD is flexible because it is
orthogonal to the choices of ranking models for the teacher and student. We
address the challenges of RD for ranking problems. The experiments on public
data sets and state-of-the-art recommendation models showed that RD achieves
its design purposes: the student model learnt with RD has a model size less
than half of the teacher model while achieving a ranking performance similar to
the teacher model and much better than the student model learnt without RD.Comment: Accepted at KDD 201
Feature Fusion for Online Mutual Knowledge Distillation
We propose a learning framework named Feature Fusion Learning (FFL) that
efficiently trains a powerful classifier through a fusion module which combines
the feature maps generated from parallel neural networks. Specifically, we
train a number of parallel neural networks as sub-networks, then we combine the
feature maps from each sub-network using a fusion module to create a more
meaningful feature map. The fused feature map is passed into the fused
classifier for overall classification. Unlike existing feature fusion methods,
in our framework, an ensemble of sub-network classifiers transfers its
knowledge to the fused classifier and then the fused classifier delivers its
knowledge back to each sub-network, mutually teaching one another in an
online-knowledge distillation manner. This mutually teaching system not only
improves the performance of the fused classifier but also obtains performance
gain in each sub-network. Moreover, our model is more beneficial because
different types of network can be used for each sub-network. We have performed
a variety of experiments on multiple datasets such as CIFAR-10, CIFAR-100 and
ImageNet and proved that our method is more effective than other alternative
methods in terms of performance of both sub-networks and the fused classifier.Comment: International Conference on Pattern Recognitio
Pseudo-Rehearsal: Achieving Deep Reinforcement Learning without Catastrophic Forgetting
Neural networks can achieve excellent results in a wide variety of
applications. However, when they attempt to sequentially learn, they tend to
learn the new task while catastrophically forgetting previous ones. We propose
a model that overcomes catastrophic forgetting in sequential reinforcement
learning by combining ideas from continual learning in both the image
classification domain and the reinforcement learning domain. This model
features a dual memory system which separates continual learning from
reinforcement learning and a pseudo-rehearsal system that "recalls" items
representative of previous tasks via a deep generative network. Our model
sequentially learns Atari 2600 games while continuing to perform above human
level and equally well as independent models trained separately on each game.
This result is achieved without: demanding additional storage requirements as
the number of tasks increases, storing raw data or revisiting past tasks. In
comparison, previous state-of-the-art solutions are substantially more
vulnerable to forgetting on these complex deep reinforcement learning tasks
Correlation Congruence for Knowledge Distillation
Most teacher-student frameworks based on knowledge distillation (KD) depend
on a strong congruent constraint on instance level. However, they usually
ignore the correlation between multiple instances, which is also valuable for
knowledge transfer. In this work, we propose a new framework named correlation
congruence for knowledge distillation (CCKD), which transfers not only the
instance-level information, but also the correlation between instances.
Furthermore, a generalized kernel method based on Taylor series expansion is
proposed to better capture the correlation between instances. Empirical
experiments and ablation studies on image classification tasks (including
CIFAR-100, ImageNet-1K) and metric learning tasks (including ReID and Face
Recognition) show that the proposed CCKD substantially outperforms the original
KD and achieves state-of-the-art accuracy compared with other SOTA KD-based
methods. The CCKD can be easily deployed in the majority of the teacher-student
framework such as KD and hint-based learning methods
Progress & Compress: A scalable framework for continual learning
We introduce a conceptually simple and scalable framework for continual
learning domains where tasks are learned sequentially. Our method is constant
in the number of parameters and is designed to preserve performance on
previously encountered tasks while accelerating learning progress on subsequent
problems. This is achieved by training a network with two components: A
knowledge base, capable of solving previously encountered problems, which is
connected to an active column that is employed to efficiently learn the current
task. After learning a new task, the active column is distilled into the
knowledge base, taking care to protect any previously acquired skills. This
cycle of active learning (progression) followed by consolidation (compression)
requires no architecture growth, no access to or storing of previous data or
tasks, and no task-specific parameters. We demonstrate the progress & compress
approach on sequential classification of handwritten alphabets as well as two
reinforcement learning domains: Atari games and 3D maze navigation.Comment: Accepted at ICML 201
Recent Advances in Convolutional Neural Network Acceleration
In recent years, convolutional neural networks (CNNs) have shown great
performance in various fields such as image classification, pattern
recognition, and multi-media compression. Two of the feature properties, local
connectivity and weight sharing, can reduce the number of parameters and
increase processing speed during training and inference. However, as the
dimension of data becomes higher and the CNN architecture becomes more
complicated, the end-to-end approach or the combined manner of CNN is
computationally intensive, which becomes limitation to CNN's further
implementation. Therefore, it is necessary and urgent to implement CNN in a
faster way. In this paper, we first summarize the acceleration methods that
contribute to but not limited to CNN by reviewing a broad variety of research
papers. We propose a taxonomy in terms of three levels, i.e.~structure level,
algorithm level, and implementation level, for acceleration methods. We also
analyze the acceleration methods in terms of CNN architecture compression,
algorithm optimization, and hardware-based improvement. At last, we give a
discussion on different perspectives of these acceleration and optimization
methods within each level. The discussion shows that the methods in each level
still have large exploration space. By incorporating such a wide range of
disciplines, we expect to provide a comprehensive reference for researchers who
are interested in CNN acceleration.Comment: submitted to Neurocomputin
Accelerating Large Scale Knowledge Distillation via Dynamic Importance Sampling
Knowledge distillation is an effective technique that transfers knowledge
from a large teacher model to a shallow student. However, just like massive
classification, large scale knowledge distillation also imposes heavy
computational costs on training models of deep neural networks, as the softmax
activations at the last layer involve computing probabilities over numerous
classes. In this work, we apply the idea of importance sampling which is often
used in Neural Machine Translation on large scale knowledge distillation. We
present a method called dynamic importance sampling, where ranked classes are
sampled from a dynamic distribution derived from the interaction between the
teacher and student in full distillation. We highlight the utility of our
proposal prior which helps the student capture the main information in the loss
function. Our approach manages to reduce the computational cost at training
time while maintaining the competitive performance on CIFAR-100 and Market-1501
person re-identification datasets
A Survey of Model Compression and Acceleration for Deep Neural Networks
Deep neural networks (DNNs) have recently achieved great success in many
visual recognition tasks. However, existing deep neural network models are
computationally expensive and memory intensive, hindering their deployment in
devices with low memory resources or in applications with strict latency
requirements. Therefore, a natural thought is to perform model compression and
acceleration in deep networks without significantly decreasing the model
performance. During the past five years, tremendous progress has been made in
this area. In this paper, we review the recent techniques for compacting and
accelerating DNN models. In general, these techniques are divided into four
categories: parameter pruning and quantization, low-rank factorization,
transferred/compact convolutional filters, and knowledge distillation. Methods
of parameter pruning and quantization are described first, after that the other
techniques are introduced. For each category, we also provide insightful
analysis about the performance, related applications, advantages, and
drawbacks. Then we go through some very recent successful methods, for example,
dynamic capacity networks and stochastic depths networks. After that, we survey
the evaluation matrices, the main datasets used for evaluating the model
performance, and recent benchmark efforts. Finally, we conclude this paper,
discuss remaining the challenges and possible directions for future work.Comment: Published in IEEE Signal Processing Magazine, updated version
including more recent work
- …