90 research outputs found
Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders
Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing
neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being
popular choices for cycling through random or single permutations of the
training data. However, the convergence properties of these algorithms in the
non-convex case are not fully understood. Existing results suggest that, in
realistic training scenarios where the number of epochs is smaller than the
training set size, RR may perform worse than SGD.
In this paper, we analyze a general SGD algorithm that allows for arbitrary
data orderings and show improved convergence rates for non-convex functions.
Specifically, our analysis reveals that SGD with random and single shuffling is
always faster or at least as good as classical SGD with replacement, regardless
of the number of iterations. Overall, our study highlights the benefits of
using SGD with random/single shuffling and provides new insights into its
convergence properties for non-convex optimization
On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms
Stochastic gradient descent (SGD) algorithm is the method of choice in many
machine learning tasks thanks to its scalability and efficiency in dealing with
large-scale problems. In this paper, we focus on the shuffling version of SGD
which matches the mainstream practical heuristics. We show the convergence to a
global solution of shuffling SGD for a class of non-convex functions under
over-parameterized settings. Our analysis employs more relaxed non-convex
assumptions than previous literature. Nevertheless, we maintain the desired
computational complexity as shuffling SGD has achieved in the general convex
setting.Comment: The 37th Conference on Neural Information Processing Systems (NeurIPS
2023
Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond
We study convergence lower bounds of without-replacement stochastic gradient
descent (SGD) for solving smooth (strongly-)convex finite-sum minimization
problems. Unlike most existing results focusing on final iterate lower bounds
in terms of the number of components and the number of epochs , we seek
bounds for arbitrary weighted average iterates that are tight in all factors
including the condition number . For SGD with Random Reshuffling, we
present lower bounds that have tighter dependencies than existing
bounds. Our results are the first to perfectly close the gap between lower and
upper bounds for weighted average iterates in both strongly-convex and convex
cases. We also prove weighted average iterate lower bounds for arbitrary
permutation-based SGD, which apply to all variants that carefully choose the
best permutation. Our bounds improve the existing bounds in factors of and
and thereby match the upper bounds shown for a recently proposed
algorithm called GraB.Comment: 58 page
- …