4 research outputs found
Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
A recent line of empirical studies has demonstrated that SGD might exhibit a
heavy-tailed behavior in practical settings, and the heaviness of the tails
might correlate with the overall performance. In this paper, we investigate the
emergence of such heavy tails. Previous works on this problem only considered,
up to our knowledge, online (also called single-pass) SGD, in which the
emergence of heavy tails in theoretical findings is contingent upon access to
an infinite amount of data. Hence, the underlying mechanism generating the
reported heavy-tailed behavior in practical settings, where the amount of
training data is finite, is still not well-understood. Our contribution aims to
fill this gap. In particular, we show that the stationary distribution of
offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and
the approximation error is controlled by how fast the empirical distribution of
the training data converges to the true underlying data distribution in the
Wasserstein metric. Our main takeaway is that, as the number of data points
increases, offline SGD will behave increasingly 'power-law-like'. To achieve
this result, we first prove nonasymptotic Wasserstein convergence bounds for
offline SGD to online SGD as the number of data points increases, which can be
interesting on their own. Finally, we illustrate our theory on various
experiments conducted on synthetic data and neural networks.Comment: In Neural Information Processing Systems (NeurIPS), Spotlight
Presentation, 202
Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
In Neural Information Processing Systems (NeurIPS), Spotlight Presentation, 2023International audienceA recent line of empirical studies has demonstrated that SGD might exhibit a heavy-tailed behavior in practical settings, and the heaviness of the tails might correlate with the overall performance. In this paper, we investigate the emergence of such heavy tails. Previous works on this problem only considered, up to our knowledge, online (also called single-pass) SGD, in which the emergence of heavy tails in theoretical findings is contingent upon access to an infinite amount of data. Hence, the underlying mechanism generating the reported heavy-tailed behavior in practical settings, where the amount of training data is finite, is still not well-understood. Our contribution aims to fill this gap. In particular, we show that the stationary distribution of offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and the approximation error is controlled by how fast the empirical distribution of the training data converges to the true underlying data distribution in the Wasserstein metric. Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly 'power-law-like'. To achieve this result, we first prove nonasymptotic Wasserstein convergence bounds for offline SGD to online SGD as the number of data points increases, which can be interesting on their own. Finally, we illustrate our theory on various experiments conducted on synthetic data and neural networks
Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
In Neural Information Processing Systems (NeurIPS), Spotlight Presentation, 2023A recent line of empirical studies has demonstrated that SGD might exhibit a heavy-tailed behavior in practical settings, and the heaviness of the tails might correlate with the overall performance. In this paper, we investigate the emergence of such heavy tails. Previous works on this problem only considered, up to our knowledge, online (also called single-pass) SGD, in which the emergence of heavy tails in theoretical findings is contingent upon access to an infinite amount of data. Hence, the underlying mechanism generating the reported heavy-tailed behavior in practical settings, where the amount of training data is finite, is still not well-understood. Our contribution aims to fill this gap. In particular, we show that the stationary distribution of offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and the approximation error is controlled by how fast the empirical distribution of the training data converges to the true underlying data distribution in the Wasserstein metric. Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly 'power-law-like'. To achieve this result, we first prove nonasymptotic Wasserstein convergence bounds for offline SGD to online SGD as the number of data points increases, which can be interesting on their own. Finally, we illustrate our theory on various experiments conducted on synthetic data and neural networks