16,100 research outputs found
Online Knowledge Distillation with Diverse Peers
Distillation is an effective knowledge-transfer technique that uses predicted
distributions of a powerful teacher model as soft targets to train a
less-parameterized student model. A pre-trained high capacity teacher, however,
is not always available. Recently proposed online variants use the aggregated
intermediate predictions of multiple student models as targets to train each
student model. Although group-derived targets give a good recipe for
teacher-free distillation, group members are homogenized quickly with simple
aggregation functions, leading to early saturated solutions. In this work, we
propose Online Knowledge Distillation with Diverse peers (OKDDip), which
performs two-level distillation during training with multiple auxiliary peers
and one group leader. In the first-level distillation, each auxiliary peer
holds an individual set of aggregation weights generated with an
attention-based mechanism to derive its own targets from predictions of other
auxiliary peers. Learning from distinct target distributions helps to boost
peer diversity for effectiveness of group-based distillation. The second-level
distillation is performed to transfer the knowledge in the ensemble of
auxiliary peers further to the group leader, i.e., the model used for
inference. Experimental results show that the proposed framework consistently
gives better performance than state-of-the-art approaches without sacrificing
training or inference complexity, demonstrating the effectiveness of the
proposed two-level distillation framework.Comment: Accepted to AAAI-202
Peer Collaborative Learning for Online Knowledge Distillation
Traditional knowledge distillation uses a two-stage training strategy to
transfer knowledge from a high-capacity teacher model to a compact student
model, which relies heavily on the pre-trained teacher. Recent online knowledge
distillation alleviates this limitation by collaborative learning, mutual
learning and online ensembling, following a one-stage end-to-end training
fashion. However, collaborative learning and mutual learning fail to construct
an online high-capacity teacher, whilst online ensembling ignores the
collaboration among branches and its logit summation impedes the further
optimisation of the ensemble teacher. In this work, we propose a novel Peer
Collaborative Learning method for online knowledge distillation, which
integrates online ensembling and network collaboration into a unified
framework. Specifically, given a target network, we construct a multi-branch
network for training, in which each branch is called a peer. We perform random
augmentation multiple times on the inputs to peers and assemble feature
representations outputted from peers with an additional classifier as the peer
ensemble teacher. This helps to transfer knowledge from a high-capacity teacher
to peers, and in turn further optimises the ensemble teacher. Meanwhile, we
employ the temporal mean model of each peer as the peer mean teacher to
collaboratively transfer knowledge among peers, which helps each peer to learn
richer knowledge and facilitates to optimise a more stable model with better
generalisation. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet show
that the proposed method significantly improves the generalisation of various
backbone networks and outperforms the state-of-the-art methods
Improving Neural Topic Models with Wasserstein Knowledge Distillation
Topic modeling is a dominant method for exploring document collections on the
web and in digital libraries. Recent approaches to topic modeling use
pretrained contextualized language models and variational autoencoders.
However, large neural topic models have a considerable memory footprint. In
this paper, we propose a knowledge distillation framework to compress a
contextualized topic model without loss in topic quality. In particular, the
proposed distillation objective is to minimize the cross-entropy of the soft
labels produced by the teacher and the student models, as well as to minimize
the squared 2-Wasserstein distance between the latent distributions learned by
the two models. Experiments on two publicly available datasets show that the
student trained with knowledge distillation achieves topic coherence much
higher than that of the original student model, and even surpasses the teacher
while containing far fewer parameters than the teacher's. The distilled model
also outperforms several other competitive topic models on topic coherence.Comment: Accepted at ECIR 202
Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection
Multi-label image classification is a fundamental but challenging task
towards general visual understanding. Existing methods found the region-level
cues (e.g., features from RoIs) can facilitate multi-label classification.
Nevertheless, such methods usually require laborious object-level annotations
(i.e., object labels and bounding boxes) for effective learning of the
object-level visual features. In this paper, we propose a novel and efficient
deep framework to boost multi-label classification by distilling knowledge from
weakly-supervised detection task without bounding box annotations.
Specifically, given the image-level annotations, (1) we first develop a
weakly-supervised detection (WSD) model, and then (2) construct an end-to-end
multi-label image classification framework augmented by a knowledge
distillation module that guides the classification model by the WSD model
according to the class-level predictions for the whole image and the
object-level visual features for object RoIs. The WSD model is the teacher
model and the classification model is the student model. After this cross-task
knowledge distillation, the performance of the classification model is
significantly improved and the efficiency is maintained since the WSD model can
be safely discarded in the test phase. Extensive experiments on two large-scale
datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior
performances over the state-of-the-art methods on both performance and
efficiency.Comment: accepted by ACM Multimedia 2018, 9 pages, 4 figures, 5 table
FreeKD: Free-direction Knowledge Distillation for Graph Neural Networks
Knowledge distillation (KD) has demonstrated its effectiveness to boost the
performance of graph neural networks (GNNs), where its goal is to distill
knowledge from a deeper teacher GNN into a shallower student GNN. However, it
is actually difficult to train a satisfactory teacher GNN due to the well-known
over-parametrized and over-smoothing issues, leading to invalid knowledge
transfer in practical applications. In this paper, we propose the first
Free-direction Knowledge Distillation framework via Reinforcement learning for
GNNs, called FreeKD, which is no longer required to provide a deeper
well-optimized teacher GNN. The core idea of our work is to collaboratively
build two shallower GNNs in an effort to exchange knowledge between them via
reinforcement learning in a hierarchical way. As we observe that one typical
GNN model often has better and worse performances at different nodes during
training, we devise a dynamic and free-direction knowledge transfer strategy
that consists of two levels of actions: 1) node-level action determines the
directions of knowledge transfer between the corresponding nodes of two
networks; and then 2) structure-level action determines which of the local
structures generated by the node-level actions to be propagated. In essence,
our FreeKD is a general and principled framework which can be naturally
compatible with GNNs of different architectures. Extensive experiments on five
benchmark datasets demonstrate our FreeKD outperforms two base GNNs in a large
margin, and shows its efficacy to various GNNs. More surprisingly, our FreeKD
has comparable or even better performance than traditional KD algorithms that
distill knowledge from a deeper and stronger teacher GNN.Comment: Accepted to KDD 202
- …