3,242 research outputs found
The Quest of Finding the Antidote to Sparse Double Descent
In energy-efficient schemes, finding the optimal size of deep learning models
is very important and has a broad impact. Meanwhile, recent studies have
reported an unexpected phenomenon, the sparse double descent: as the model's
sparsity increases, the performance first worsens, then improves, and finally
deteriorates. Such a non-monotonic behavior raises serious questions about the
optimal model's size to maintain high performance: the model needs to be
sufficiently over-parametrized, but having too many parameters wastes training
resources.
In this paper, we aim to find the best trade-off efficiently. More precisely,
we tackle the occurrence of the sparse double descent and present some
solutions to avoid it. Firstly, we show that a simple regularization
method can help to mitigate this phenomenon but sacrifices the
performance/sparsity compromise. To overcome this problem, we then introduce a
learning scheme in which distilling knowledge regularizes the student model.
Supported by experimental results achieved using typical image classification
setups, we show that this approach leads to the avoidance of such a phenomenon
Reconciling modern machine learning practice and the bias-variance trade-off
Breakthroughs in machine learning are rapidly changing science and society,
yet our fundamental understanding of this technology has lagged far behind.
Indeed, one of the central tenets of the field, the bias-variance trade-off,
appears to be at odds with the observed behavior of methods used in the modern
machine learning practice. The bias-variance trade-off implies that a model
should balance under-fitting and over-fitting: rich enough to express
underlying structure in data, simple enough to avoid fitting spurious patterns.
However, in the modern practice, very rich models such as neural networks are
trained to exactly fit (i.e., interpolate) the data. Classically, such models
would be considered over-fit, and yet they often obtain high accuracy on test
data. This apparent contradiction has raised questions about the mathematical
foundations of machine learning and their relevance to practitioners.
In this paper, we reconcile the classical understanding and the modern
practice within a unified performance curve. This "double descent" curve
subsumes the textbook U-shaped bias-variance trade-off curve by showing how
increasing model capacity beyond the point of interpolation results in improved
performance. We provide evidence for the existence and ubiquity of double
descent for a wide spectrum of models and datasets, and we posit a mechanism
for its emergence. This connection between the performance and the structure of
machine learning models delineates the limits of classical analyses, and has
implications for both the theory and practice of machine learning
Dropout Drops Double Descent
In this paper, we find and analyze that we can easily drop the double descent
by only adding one dropout layer before the fully-connected linear layer. The
surprising double-descent phenomenon has drawn public attention in recent
years, making the prediction error rise and drop as we increase either sample
or model size. The current paper shows that it is possible to alleviate these
phenomena by using optimal dropout in the linear regression model and the
nonlinear random feature regression, both theoretically and empirically. %
with . We obtain the
optimal dropout hyperparameter by estimating the ground truth with
generalized ridge typed estimator
. Moreover, we
empirically show that optimal dropout can achieve a monotonic test error curve
in nonlinear neural networks using Fashion-MNIST and CIFAR-10. Our results
suggest considering dropout for risk curve scaling when meeting the peak
phenomenon. In addition, we figure out why previous deep learning models do not
encounter double-descent scenarios -- because we already apply a usual
regularization approach like the dropout in our models. To our best knowledge,
this paper is the first to analyze the relationship between dropout and double
descent
Can we avoid Double Descent in Deep Neural Networks?
Finding the optimal size of deep learning models is very actual and of broad
impact, especially in energy-saving schemes. Very recently, an unexpected
phenomenon, the ``double descent'', has caught the attention of the deep
learning community. As the model's size grows, the performance gets first
worse, and then goes back to improving. It raises serious questions about the
optimal model's size to maintain high generalization: the model needs to be
sufficiently over-parametrized, but adding too many parameters wastes training
resources. Is it possible to find, in an efficient way, the best trade-off? Our
work shows that the double descent phenomenon is potentially avoidable with
proper conditioning of the learning problem, but a final answer is yet to be
found. We empirically observe that there is hope to dodge the double descent in
complex scenarios with proper regularization, as a simple
regularization is already positively contributing to such a perspective
Label Noise in Adversarial Training: A Novel Perspective to Study Robust Overfitting
We show that label noise exists in adversarial training. Such label noise is
due to the mismatch between the true label distribution of adversarial examples
and the label inherited from clean examples - the true label distribution is
distorted by the adversarial perturbation, but is neglected by the common
practice that inherits labels from clean examples. Recognizing label noise
sheds insights on the prevalence of robust overfitting in adversarial training,
and explains its intriguing dependence on perturbation radius and data quality.
Also, our label noise perspective aligns well with our observations of the
epoch-wise double descent in adversarial training. Guided by our analyses, we
proposed a method to automatically calibrate the label to address the label
noise and robust overfitting. Our method achieves consistent performance
improvements across various models and datasets without introducing new
hyper-parameters or additional tuning.Comment: Neurips 2022 (Oral); A previous version of this paper (v1) used the
title `Double Descent in Adversarial Training: An Implicit Label Noise
Perspective
- …