1,145 research outputs found
Knowledge Transfer with Jacobian Matching
Classical distillation methods transfer representations from a "teacher"
neural network to a "student" network by matching their output activations.
Recent methods also match the Jacobians, or the gradient of output activations
with the input. However, this involves making some ad hoc decisions, in
particular, the choice of the loss function.
In this paper, we first establish an equivalence between Jacobian matching
and distillation with input noise, from which we derive appropriate loss
functions for Jacobian matching. We then rely on this analysis to apply
Jacobian matching to transfer learning by establishing equivalence of a recent
transfer learning procedure to distillation.
We then show experimentally on standard image datasets that Jacobian-based
penalties improve distillation, robustness to noisy inputs, and transfer
learning
Conditional Teacher-Student Learning
The teacher-student (T/S) learning has been shown to be effective for a
variety of problems such as domain adaptation and model compression. One
shortcoming of the T/S learning is that a teacher model, not always perfect,
sporadically produces wrong guidance in form of posterior probabilities that
misleads the student model towards a suboptimal performance. To overcome this
problem, we propose a conditional T/S learning scheme, in which a "smart"
student model selectively chooses to learn from either the teacher model or the
ground truth labels conditioned on whether the teacher can correctly predict
the ground truth. Unlike a naive linear combination of the two knowledge
sources, the conditional learning is exclusively engaged with the teacher model
when the teacher model's prediction is correct, and otherwise backs off to the
ground truth. Thus, the student model is able to learn effectively from the
teacher and even potentially surpass the teacher. We examine the proposed
learning scheme on two tasks: domain adaptation on CHiME-3 dataset and speaker
adaptation on Microsoft short message dictation dataset. The proposed method
achieves 9.8% and 12.8% relative word error rate reductions, respectively, over
T/S learning for environment adaptation and speaker-independent model for
speaker adaptation.Comment: 5 pages, 1 figure, ICASSP 201
- …