20 research outputs found
Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD
When using Stochastic Gradient Descent (SGD) for training machine learning
models, it is often crucial to provide the model with examples sampled at
random from the dataset. However, for large datasets stored in the cloud,
random access to individual examples is often costly and inefficient. A recent
work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile,
which greatly improves efficiency of data access, at the cost some performance
loss, which is particularly apparent for large datasets stored in homogeneous
shards (e.g., video datasets). In this paper, we introduce a novel two-step
partial data shuffling strategy for SGD which combines an offline iteration of
the CorgiPile method with a subsequent online iteration. Our approach enjoys
the best of both worlds: it performs similarly to SGD with random access (even
for homogenous data) without compromising the data access efficiency of
CorgiPile. We provide a comprehensive theoretical analysis of the convergence
properties of our method and demonstrate its practical advantages through
experimental results.Comment: 19 pages, 5 figure
On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms
Stochastic gradient descent (SGD) algorithm is the method of choice in many
machine learning tasks thanks to its scalability and efficiency in dealing with
large-scale problems. In this paper, we focus on the shuffling version of SGD
which matches the mainstream practical heuristics. We show the convergence to a
global solution of shuffling SGD for a class of non-convex functions under
over-parameterized settings. Our analysis employs more relaxed non-convex
assumptions than previous literature. Nevertheless, we maintain the desired
computational complexity as shuffling SGD has achieved in the general convex
setting.Comment: The 37th Conference on Neural Information Processing Systems (NeurIPS
2023
Uniformly convex neural networks and non-stationary iterated network Tikhonov (iNETT) method
We propose a non-stationary iterated network Tikhonov (iNETT) method for the
solution of ill-posed inverse problems. The iNETT employs deep neural networks
to build a data-driven regularizer, and it avoids the difficult task of
estimating the optimal regularization parameter. To achieve the theoretical
convergence of iNETT, we introduce uniformly convex neural networks to build
the data-driven regularizer. Rigorous theories and detailed algorithms are
proposed for the construction of convex and uniformly convex neural networks.
In particular, given a general neural network architecture, we prescribe
sufficient conditions to achieve a trained neural network which is
component-wise convex or uniformly convex; moreover, we provide concrete
examples of realizing convexity and uniform convexity in the modern U-net
architecture. With the tools of convex and uniformly convex neural networks,
the iNETT algorithm is developed and a rigorous convergence analysis is
provided. Lastly, we show applications of the iNETT algorithm in 2D
computerized tomography, where numerical examples illustrate the efficacy of
the proposed algorithm
High Probability Guarantees for Random Reshuffling
We consider the stochastic gradient method with random reshuffling
() for tackling smooth nonconvex optimization problems.
finds broad applications in practice, notably in training neural
networks. In this work, we first investigate the concentration property of
's sampling procedure and establish a new high probability sample
complexity guarantee for driving the gradient (without expectation) below
, which effectively characterizes the efficiency of a single
execution. Our derived complexity matches the best existing
in-expectation one up to a logarithmic term while imposing no additional
assumptions nor changing 's updating rule. Furthermore, by
leveraging our derived high probability descent property and bound on the
stochastic error, we propose a simple and computable stopping criterion for
(denoted as -). This criterion is
guaranteed to be triggered after a finite number of iterations, and then
- returns an iterate with its gradient below
with high probability. Moreover, building on the proposed
stopping criterion, we design a perturbed random reshuffling method
(-) that involves an additional randomized
perturbation procedure near stationary points. We derive that
- provably escapes strict saddle points and
efficiently returns a second-order stationary point with high probability,
without making any sub-Gaussian tail-type assumptions on the stochastic
gradient errors. Finally, we conduct numerical experiments on neural network
training to support our theoretical findings.Comment: 21 pages, 3 figure
Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders
Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing
neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being
popular choices for cycling through random or single permutations of the
training data. However, the convergence properties of these algorithms in the
non-convex case are not fully understood. Existing results suggest that, in
realistic training scenarios where the number of epochs is smaller than the
training set size, RR may perform worse than SGD.
In this paper, we analyze a general SGD algorithm that allows for arbitrary
data orderings and show improved convergence rates for non-convex functions.
Specifically, our analysis reveals that SGD with random and single shuffling is
always faster or at least as good as classical SGD with replacement, regardless
of the number of iterations. Overall, our study highlights the benefits of
using SGD with random/single shuffling and provides new insights into its
convergence properties for non-convex optimization