18 research outputs found
Recommended from our members
Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models
In this work, we construct generalization bounds to understand existing learning algorithms and propose new ones. Generalization bounds relate empirical performance to future expected performance. The tightness of these bounds vary widely, and depends on the complexity of the learning task and the amount of data available, but also on how much information the bounds take into consideration. We are particularly concerned with data and algorithm- dependent bounds that are quantitatively nonvacuous. We begin with an analysis of stochastic gradient descent (SGD) in supervised learning. By formalizing the notion of flat minima using PAC-Bayes generalization bounds, we obtain nonvacuous generalization bounds for stochastic classifiers based on SGD solutions. Despite strong empirical performance in many settings, SGD rapidly overfits in others. By combining nonvacuous generalization bounds and structural risk minimization, we arrive at an algorithm that trades-off accuracy and generalization guarantees. We also study generalization in the context of unsupervised learning. We propose to use a two sample test statistic for training neural network generator models and bound the gap between the population and the empirical estimate of the statistic.EPSR
Deep Learning on a Data Diet: Finding Important Examples Early in Training
The recent success of deep learning has partially been driven by training
increasingly overparametrized networks on ever larger datasets. It is therefore
natural to ask: how much of the data is superfluous, which examples are
important for generalization, and how do we find them? In this work, we make
the striking observation that, on standard vision benchmarks, the initial loss
gradient norm of individual training examples, averaged over several weight
initializations, can be used to identify a smaller set of training data that is
important for generalization. Furthermore, after only a few epochs of training,
the information in gradient norms is reflected in the normed error--L2 distance
between the predicted probabilities and one hot labels--which can be used to
prune a significant fraction of the dataset without sacrificing test accuracy.
Based on this, we propose data pruning methods which use only local information
early in training, and connect them to recent work that prunes data by
discarding examples that are rarely forgotten over the course of training. Our
methods also shed light on how the underlying data distribution shapes the
training dynamics: they rank examples based on their importance for
generalization, detect noisy examples and identify subspaces of the model's
data representation that are relatively stable over training.Comment: 18 pages, 16 figure
Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias
Neural networks trained with (stochastic) gradient descent have an inductive
bias towards learning simpler solutions. This makes them highly prone to
learning simple spurious features that are highly correlated with a label
instead of the predictive but more complex core features. In this work, we show
that, interestingly, the simplicity bias of gradient descent can be leveraged
to identify spurious correlations, early in training. First, we prove on a
two-layer neural network, that groups of examples with high spurious
correlation are separable based on the model's output, in the initial training
iterations. We further show that if spurious features have a small enough
noise-to-signal ratio, the network's output on the majority of examples in a
class will be almost exclusively determined by the spurious features and will
be nearly invariant to the core feature. Finally, we propose SPARE, which
separates large groups with spurious correlations early in training, and
utilizes importance sampling to alleviate the spurious correlation, by
balancing the group sizes. We show that SPARE achieves up to 5.6% higher
worst-group accuracy than state-of-the-art methods, while being up to 12x
faster. We also show the applicability of SPARE to discover and mitigate
spurious correlations in Restricted ImageNet