20 research outputs found

    Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

    Full text link
    When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost some performance loss, which is particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we introduce a novel two-step partial data shuffling strategy for SGD which combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data access efficiency of CorgiPile. We provide a comprehensive theoretical analysis of the convergence properties of our method and demonstrate its practical advantages through experimental results.Comment: 19 pages, 5 figure

    On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms

    Full text link
    Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.Comment: The 37th Conference on Neural Information Processing Systems (NeurIPS 2023

    Uniformly convex neural networks and non-stationary iterated network Tikhonov (iNETT) method

    Full text link
    We propose a non-stationary iterated network Tikhonov (iNETT) method for the solution of ill-posed inverse problems. The iNETT employs deep neural networks to build a data-driven regularizer, and it avoids the difficult task of estimating the optimal regularization parameter. To achieve the theoretical convergence of iNETT, we introduce uniformly convex neural networks to build the data-driven regularizer. Rigorous theories and detailed algorithms are proposed for the construction of convex and uniformly convex neural networks. In particular, given a general neural network architecture, we prescribe sufficient conditions to achieve a trained neural network which is component-wise convex or uniformly convex; moreover, we provide concrete examples of realizing convexity and uniform convexity in the modern U-net architecture. With the tools of convex and uniformly convex neural networks, the iNETT algorithm is developed and a rigorous convergence analysis is provided. Lastly, we show applications of the iNETT algorithm in 2D computerized tomography, where numerical examples illustrate the efficacy of the proposed algorithm

    High Probability Guarantees for Random Reshuffling

    Full text link
    We consider the stochastic gradient method with random reshuffling (RR\mathsf{RR}) for tackling smooth nonconvex optimization problems. RR\mathsf{RR} finds broad applications in practice, notably in training neural networks. In this work, we first investigate the concentration property of RR\mathsf{RR}'s sampling procedure and establish a new high probability sample complexity guarantee for driving the gradient (without expectation) below ε\varepsilon, which effectively characterizes the efficiency of a single RR\mathsf{RR} execution. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing RR\mathsf{RR}'s updating rule. Furthermore, by leveraging our derived high probability descent property and bound on the stochastic error, we propose a simple and computable stopping criterion for RR\mathsf{RR} (denoted as RR\mathsf{RR}-sc\mathsf{sc}). This criterion is guaranteed to be triggered after a finite number of iterations, and then RR\mathsf{RR}-sc\mathsf{sc} returns an iterate with its gradient below ε\varepsilon with high probability. Moreover, building on the proposed stopping criterion, we design a perturbed random reshuffling method (p\mathsf{p}-RR\mathsf{RR}) that involves an additional randomized perturbation procedure near stationary points. We derive that p\mathsf{p}-RR\mathsf{RR} provably escapes strict saddle points and efficiently returns a second-order stationary point with high probability, without making any sub-Gaussian tail-type assumptions on the stochastic gradient errors. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.Comment: 21 pages, 3 figure

    Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders

    Full text link
    Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being popular choices for cycling through random or single permutations of the training data. However, the convergence properties of these algorithms in the non-convex case are not fully understood. Existing results suggest that, in realistic training scenarios where the number of epochs is smaller than the training set size, RR may perform worse than SGD. In this paper, we analyze a general SGD algorithm that allows for arbitrary data orderings and show improved convergence rates for non-convex functions. Specifically, our analysis reveals that SGD with random and single shuffling is always faster or at least as good as classical SGD with replacement, regardless of the number of iterations. Overall, our study highlights the benefits of using SGD with random/single shuffling and provides new insights into its convergence properties for non-convex optimization
    corecore