7 research outputs found
Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation
We explore best practices for training small, memory efficient machine
translation models with sequence-level knowledge distillation in the domain
adaptation setting. While both domain adaptation and knowledge distillation are
widely-used, their interaction remains little understood. Our large-scale
empirical results in machine translation (on three language pairs with three
domains each) suggest distilling twice for best performance: once using
general-domain data and again using in-domain data with an adapted teacher.Comment: Accepted to WNGT 2020 Workshop at ACL 2020 Conference. Code is at
http://github.com/mitchellgordon95/kd-au
Self-Adaptive Training: beyond Empirical Risk Minimization
We propose self-adaptive training---a new training algorithm that dynamically
corrects problematic training labels by model predictions without incurring
extra computational cost---to improve generalization of deep learning for
potentially corrupted training data. This problem is crucial towards robustly
learning from data that are corrupted by, e.g., label noises and
out-of-distribution samples. The standard empirical risk minimization (ERM) for
such data, however, may easily overfit noises and thus suffers from sub-optimal
performance. In this paper, we observe that model predictions can substantially
benefit the training process: self-adaptive training significantly improves
generalization over ERM under various levels of noises, and mitigates the
overfitting issue in both natural and adversarial training. We evaluate the
error-capacity curve of self-adaptive training: the test error is monotonously
decreasing w.r.t. model capacity. This is in sharp contrast to the
recently-discovered double-descent phenomenon in ERM which might be a result of
overfitting of noises. Experiments on CIFAR and ImageNet datasets verify the
effectiveness of our approach in two applications: classification with label
noise and selective classification. We release our code at
https://github.com/LayneH/self-adaptive-training.Comment: To appear in NeurIPS 202
Distilling Double Descent
Distillation is the technique of training a "student" model based on examples
that are labeled by a separate "teacher" model, which itself is trained on a
labeled dataset. The most common explanations for why distillation "works" are
predicated on the assumption that student is provided with \emph{soft} labels,
\eg probabilities or confidences, from the teacher model. In this work, we
show, that, even when the teacher model is highly overparameterized, and
provides \emph{hard} labels, using a very large held-out unlabeled dataset to
train the student model can result in a model that outperforms more
"traditional" approaches.
Our explanation for this phenomenon is based on recent work on "double
descent". It has been observed that, once a model's complexity roughly exceeds
the amount required to memorize the training data, increasing the complexity
\emph{further} can, counterintuitively, result in \emph{better} generalization.
Researchers have identified several settings in which it takes place, while
others have made various attempts to explain it (thus far, with only partial
success). In contrast, we avoid these questions, and instead seek to
\emph{exploit} this phenomenon by demonstrating that a highly-overparameterized
teacher can avoid overfitting via double descent, while a student trained on a
larger independent dataset labeled by this teacher will avoid overfitting due
to the size of its training set
Distillation Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network
Distillation is a method to transfer knowledge from one model to another and
often achieves higher accuracy with the same capacity. In this paper, we aim to
provide a theoretical understanding on what mainly helps with the distillation.
Our answer is "early stopping". Assuming that the teacher network is
overparameterized, we argue that the teacher network is essentially harvesting
dark knowledge from the data via early stopping. This can be justified by a new
concept, {Anisotropic Information Retrieval (AIR)}, which means that the neural
network tends to fit the informative information first and the non-informative
information (including noise) later. Motivated by the recent development on
theoretically analyzing overparameterized neural networks, we can characterize
AIR by the eigenspace of the Neural Tangent Kernel(NTK). AIR facilities a new
understanding of distillation. With that, we further utilize distillation to
refine noisy labels. We propose a self-distillation algorithm to sequentially
distill knowledge from the network in the previous training epoch to avoid
memorizing the wrong labels. We also demonstrate, both theoretically and
empirically, that self-distillation can benefit from more than just early
stopping. Theoretically, we prove convergence of the proposed algorithm to the
ground truth labels for randomly initialized overparameterized neural networks
in terms of distance, while the previous result was on convergence in
- loss. The theoretical result ensures the learned neural network enjoy a
margin on the training data which leads to better generalization. Empirically,
we achieve better testing accuracy and entirely avoid early stopping which
makes the algorithm more user-friendly.Comment: Accepted by NeurIPS 2019 Workshop on Machine Learning with
Guarantees. Submitted to other place
Why distillation helps: a statistical perspective
Knowledge distillation is a technique for improving the performance of a
simple "student" model by replacing its one-hot training labels with a
distribution over labels obtained from a complex "teacher" model. While this
simple approach has proven widely effective, a basic question remains
unresolved: why does distillation help? In this paper, we present a statistical
perspective on distillation which addresses this question, and provides a novel
connection to extreme multiclass retrieval techniques. Our core observation is
that the teacher seeks to estimate the underlying (Bayes) class-probability
function. Building on this, we establish a fundamental bias-variance tradeoff
in the student's objective: this quantifies how approximate knowledge of these
class-probabilities can significantly aid learning. Finally, we show how
distillation complements existing negative mining techniques for extreme
multiclass retrieval, and propose a unified objective which combines these
ideas
Self-Distillation Amplifies Regularization in Hilbert Space
Knowledge distillation introduced in the deep learning context is a method to
transfer knowledge from one architecture to another. In particular, when the
architectures are identical, this is called self-distillation. The idea is to
feed in predictions of the trained model as new target values for retraining
(and iterate this loop possibly a few times). It has been empirically observed
that the self-distilled model often achieves higher accuracy on held out data.
Why this happens, however, has been a mystery: the self-distillation dynamics
does not receive any new information about the task and solely evolves by
looping over training. To the best of our knowledge, there is no rigorous
understanding of why this happens. This work provides the first theoretical
analysis of self-distillation. We focus on fitting a nonlinear function to
training data, where the model space is Hilbert space and fitting is subject to
L2 regularization in this function space. We show that self-distillation
iterations modify regularization by progressively limiting the number of basis
functions that can be used to represent the solution. This implies (as we also
verify empirically) that while a few rounds of self-distillation may reduce
over-fitting, further rounds may lead to under-fitting and thus worse
performance
When Does Preconditioning Help or Hurt Generalization?
While second order optimizers such as natural gradient descent (NGD) often
speed up optimization, their effect on generalization has been called into
question. This work presents a more nuanced view on how the \textit{implicit
bias} of first- and second-order methods affects the comparison of
generalization properties. We provide an exact asymptotic bias-variance
decomposition of the generalization error of overparameterized ridgeless
regression under a general class of preconditioner , and
consider the inverse population Fisher information matrix (used in NGD) as a
particular example. We determine the optimal for both the bias
and variance, and find that the relative generalization performance of
different optimizers depends on the label noise and the "shape" of the signal
(true parameters): when the labels are noisy, the model is misspecified, or the
signal is misaligned with the features, NGD can achieve lower risk; conversely,
GD generalizes better than NGD under clean labels, a well-specified model, or
aligned signal. Based on this analysis, we discuss several approaches to manage
the bias-variance tradeoff, and the potential benefit of interpolating between
GD and NGD. We then extend our analysis to regression in the reproducing kernel
Hilbert space and demonstrate that preconditioned GD can decrease the
population risk faster than GD. Lastly, we empirically compare the
generalization error of first- and second-order optimizers in neural network
experiments, and observe robust trends matching our theoretical analysis.Comment: 42 page