1,852 research outputs found
Improved Feature Distillation via Projector Ensemble
In knowledge distillation, previous feature distillation methods mainly focus
on the design of loss functions and the selection of the distilled layers,
while the effect of the feature projector between the student and the teacher
remains under-explored. In this paper, we first discuss a plausible mechanism
of the projector with empirical evidence and then propose a new feature
distillation method based on a projector ensemble for further performance
improvement. We observe that the student network benefits from a projector even
if the feature dimensions of the student and the teacher are the same. Training
a student backbone without a projector can be considered as a multi-task
learning process, namely achieving discriminative feature extraction for
classification and feature matching between the student and the teacher for
distillation at the same time. We hypothesize and empirically verify that
without a projector, the student network tends to overfit the teacher's feature
distributions despite having different architecture and weights initialization.
This leads to degradation on the quality of the student's deep features that
are eventually used in classification. Adding a projector, on the other hand,
disentangles the two learning tasks and helps the student network to focus
better on the main feature extraction task while still being able to utilize
teacher features as a guidance through the projector. Motivated by the positive
effect of the projector in feature distillation, we propose an ensemble of
projectors to further improve the quality of student features. Experimental
results on different datasets with a series of teacher-student pairs illustrate
the effectiveness of the proposed method
Knowledge Distillation Under Ideal Joint Classifier Assumption
Knowledge distillation is a powerful technique to compress large neural
networks into smaller, more efficient networks. Softmax regression
representation learning is a popular approach that uses a pre-trained teacher
network to guide the learning of a smaller student network. While several
studies explored the effectiveness of softmax regression representation
learning, the underlying mechanism that provides knowledge transfer is not well
understood. This paper presents Ideal Joint Classifier Knowledge Distillation
(IJCKD), a unified framework that provides a clear and comprehensive
understanding of the existing knowledge distillation methods and a theoretical
foundation for future research. Using mathematical techniques derived from a
theory of domain adaptation, we provide a detailed analysis of the student
network's error bound as a function of the teacher. Our framework enables
efficient knowledge transfer between teacher and student networks and can be
applied to various applications
Understanding the Effects of Projectors in Knowledge Distillation
Conventionally, during the knowledge distillation process (e.g. feature
distillation), an additional projector is often required to perform feature
transformation due to the dimension mismatch between the teacher and the
student networks. Interestingly, we discovered that even if the student and the
teacher have the same feature dimensions, adding a projector still helps to
improve the distillation performance. In addition, projectors even improve
logit distillation if we add them to the architecture too. Inspired by these
surprising findings and the general lack of understanding of the projectors in
the knowledge distillation process from existing literature, this paper
investigates the implicit role that projectors play but so far have been
overlooked. Our empirical study shows that the student with a projector (1)
obtains a better trade-off between the training accuracy and the testing
accuracy compared to the student without a projector when it has the same
feature dimensions as the teacher, (2) better preserves its similarity to the
teacher beyond shallow and numeric resemblance, from the view of Centered
Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does
at the testing phase. Motivated by the positive effects of projectors, we
propose a projector ensemble-based feature distillation method to further
improve distillation performance. Despite the simplicity of the proposed
strategy, empirical results from the evaluation of classification tasks on
benchmark datasets demonstrate the superior classification performance of our
method on a broad range of teacher-student pairs and verify from the aspects of
CKA and model calibration that the student's features are of improved quality
with the projector ensemble design.Comment: arXiv admin note: text overlap with arXiv:2210.1527
PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation
We propose a novel knowledge distillation methodology for compressing deep
neural networks. One of the most efficient methods for knowledge distillation
is hint distillation, where the student model is injected with information
(hints) from several different layers of the teacher model. Although the
selection of hint points can drastically alter the compression performance,
there is no systematic approach for selecting them, other than brute-force
hyper-parameter search. We propose a clustering based hint selection
methodology, where the layers of teacher model are clustered with respect to
several metrics and the cluster centers are used as the hint points. The
proposed approach is validated in CIFAR-100 dataset, where ResNet-110 network
was used as the teacher model. Our results show that hint points selected by
our algorithm results in superior compression performance with respect to
state-of-the-art knowledge distillation algorithms on the same student models
and datasets
- …