19 research outputs found
Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders
Knowledge distillation (KD) has been a ubiquitous method for model
compression to strengthen the capability of a lightweight model with the
transferred knowledge from the teacher. In particular, KD has been employed in
quantization-aware training (QAT) of Transformer encoders like BERT to improve
the accuracy of the student model with the reduced-precision weight parameters.
However, little is understood about which of the various KD approaches best
fits the QAT of Transformers. In this work, we provide an in-depth analysis of
the mechanism of KD on attention recovery of quantized large Transformers. In
particular, we reveal that the previously adopted MSE loss on the attention
score is insufficient for recovering the self-attention information. Therefore,
we propose two KD methods; attention-map and attention-output losses.
Furthermore, we explore the unification of both losses to address
task-dependent preference between attention-map and output losses. The
experimental results on various Transformer encoder models demonstrate that the
proposed KD methods achieve state-of-the-art accuracy for QAT with sub-2-bit
weight quantization.Comment: EMNLP 2022 Main Track Long Pape
Heterogeneous Generative Knowledge Distillation with Masked Image Modeling
Small CNN-based models usually require transferring knowledge from a large
model before they are deployed in computationally resource-limited edge
devices. Masked image modeling (MIM) methods achieve great success in various
visual tasks but remain largely unexplored in knowledge distillation for
heterogeneous deep models. The reason is mainly due to the significant
discrepancy between the Transformer-based large model and the CNN-based small
network. In this paper, we develop the first Heterogeneous Generative Knowledge
Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge
from large Transformer models to small CNN-based models in a generative
self-supervised fashion. Our method builds a bridge between Transformer-based
models and CNNs by training a UNet-style student with sparse convolution, which
can effectively mimic the visual representation inferred by a teacher over
masked modeling. Our method is a simple yet effective learning paradigm to
learn the visual representation and distribution of data from heterogeneous
teacher models, which can be pre-trained using advanced generative methods.
Extensive experiments show that it adapts well to various models and sizes,
consistently achieving state-of-the-art performance in image classification,
object detection, and semantic segmentation tasks. For example, in the Imagenet
1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to
80.01%
Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models
Predicting a scene graph that captures visual entities and their interactions
in an image has been considered a crucial step towards full scene
comprehension. Recent scene graph generation (SGG) models have shown their
capability of capturing the most frequent relations among visual entities.
However, the state-of-the-art results are still far from satisfactory, e.g.
models can obtain 31% in overall recall R@100, whereas the likewise important
mean class-wise recall mR@100 is only around 8% on Visual Genome (VG). The
discrepancy between R and mR results urges to shift the focus from pursuing a
high R to a high mR with a still competitive R. We suspect that the observed
discrepancy stems from both the annotation bias and sparse annotations in VG,
in which many visual entity pairs are either not annotated at all or only with
a single relation when multiple ones could be valid. To address this particular
issue, we propose a novel SGG training scheme that capitalizes on self-learned
knowledge. It involves two relation classifiers, one offering a less biased
setting for the other to base on. The proposed scheme can be applied to most of
the existing SGG models and is straightforward to implement. We observe
significant relative improvements in mR (between +6.6% and +20.4%) and
competitive or better R (between -2.4% and 0.3%) across all standard SGG tasks.Comment: accepted to BMVC202
Towards Better Query Classification with Multi-Expert Knowledge Condensation in JD Ads Search
Search query classification, as an effective way to understand user intents,
is of great importance in real-world online ads systems. To ensure a lower
latency, a shallow model (e.g. FastText) is widely used for efficient online
inference. However, the representation ability of the FastText model is
insufficient, resulting in poor classification performance, especially on some
low-frequency queries and tailed categories. Using a deeper and more complex
model (e.g. BERT) is an effective solution, but it will cause a higher online
inference latency and more expensive computing costs. Thus, how to juggle both
inference efficiency and classification performance is obviously of great
practical importance. To overcome this challenge, in this paper, we propose
knowledge condensation (KC), a simple yet effective knowledge distillation
framework to boost the classification performance of the online FastText model
under strict low latency constraints. Specifically, we propose to train an
offline BERT model to retrieve more potentially relevant data. Benefiting from
its powerful semantic representation, more relevant labels not exposed in the
historical data will be added into the training set for better FastText model
training. Moreover, a novel distribution-diverse multi-expert learning strategy
is proposed to further improve the mining ability of relevant data. By training
multiple BERT models from different data distributions, it can respectively
perform better at high, middle, and low-frequency search queries. The model
ensemble from multi-distribution makes its retrieval ability more powerful. We
have deployed two versions of this framework in JD search, and both offline
experiments and online A/B testing from multiple datasets have validated the
effectiveness of the proposed approach
Stochastic Precision Ensemble: Self-Knowledge Distillation for Quantized Deep Neural Networks
The quantization of deep neural networks (QDNNs) has been actively studied
for deployment in edge devices. Recent studies employ the knowledge
distillation (KD) method to improve the performance of quantized networks. In
this study, we propose stochastic precision ensemble training for QDNNs (SPEQ).
SPEQ is a knowledge distillation training scheme; however, the teacher is
formed by sharing the model parameters of the student network. We obtain the
soft labels of the teacher by changing the bit precision of the activation
stochastically at each layer of the forward-pass computation. The student model
is trained with these soft labels to reduce the activation quantization noise.
The cosine similarity loss is employed, instead of the KL-divergence, for KD
training. As the teacher model changes continuously by random bit-precision
assignment, it exploits the effect of stochastic ensemble KD. SPEQ outperforms
the existing quantization training methods in various tasks, such as image
classification, question-answering, and transfer learning without the need for
cumbersome teacher networks
Learning From Biased Soft Labels
Knowledge distillation has been widely adopted in a variety of tasks and has
achieved remarkable successes. Since its inception, many researchers have been
intrigued by the dark knowledge hidden in the outputs of the teacher model.
Recently, a study has demonstrated that knowledge distillation and label
smoothing can be unified as learning from soft labels. Consequently, how to
measure the effectiveness of the soft labels becomes an important question.
Most existing theories have stringent constraints on the teacher model or data
distribution, and many assumptions imply that the soft labels are close to the
ground-truth labels. This paper studies whether biased soft labels are still
effective. We present two more comprehensive indicators to measure the
effectiveness of such soft labels. Based on the two indicators, we give
sufficient conditions to ensure biased soft label based learners are
classifier-consistent and ERM learnable. The theory is applied to three
weakly-supervised frameworks. Experimental results validate that biased soft
labels can also teach good students, which corroborates the soundness of the
theory