The recent success of deep learning has partially been driven by training
increasingly overparametrized networks on ever larger datasets. It is therefore
natural to ask: how much of the data is superfluous, which examples are
important for generalization, and how do we find them? In this work, we make
the striking observation that, on standard vision benchmarks, the initial loss
gradient norm of individual training examples, averaged over several weight
initializations, can be used to identify a smaller set of training data that is
important for generalization. Furthermore, after only a few epochs of training,
the information in gradient norms is reflected in the normed error--L2 distance
between the predicted probabilities and one hot labels--which can be used to
prune a significant fraction of the dataset without sacrificing test accuracy.
Based on this, we propose data pruning methods which use only local information
early in training, and connect them to recent work that prunes data by
discarding examples that are rarely forgotten over the course of training. Our
methods also shed light on how the underlying data distribution shapes the
training dynamics: they rank examples based on their importance for
generalization, detect noisy examples and identify subspaces of the model's
data representation that are relatively stable over training.Comment: 18 pages, 16 figure