    Minimizing Finite Sums with the Stochastic Average Gradient

    We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for p \textless{} 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated literature follow and discussion of subsequent work, additional Lemma showing the validity of one of the formulas, somewhat simplified presentation of Lyapunov bound, included code needed for checking proofs rather than the polynomials generated by the code, added error regions to the numerical experiment

    Recent Advances in Randomized Methods for Big Data Optimization

    In this thesis, we discuss and develop randomized algorithms for big data problems. In particular, we study the finite-sum optimization with newly emerged variance- reduction optimization methods (Chapter 2), explore the efficiency of second-order information applied to both convex and non-convex finite-sum objectives (Chapter 3) and employ the fast first-order method in power system problems (Chapter 4).In Chapter 2, we propose two variance-reduced gradient algorithms – mS2GD and SARAH. mS2GD incorporates a mini-batching scheme for improving the theoretical complexity and practical performance of SVRG/S2GD, aiming to minimize a strongly convex function represented as the sum of an average of a large number of smooth con- vex functions and a simple non-smooth convex regularizer. While SARAH, short for StochAstic Recursive grAdient algoritHm and using a stochastic recursive gradient, targets at minimizing the average of a large number of smooth functions for both con- vex and non-convex cases. Both methods fall into the category of variance-reduction optimization, and obtain a total complexity of O((n+κ)log(1/ε)) to achieve an ε-accuracy solution for strongly convex objectives, while SARAH also maintains a sub-linear convergence for non-convex problems. Meanwhile, SARAH has a practical variant SARAH+ due to its linear convergence of the expected stochastic gradients in inner loops.In Chapter 3, we declare that randomized batches can be applied with second- order information, as to improve upon convergence in both theory and practice, with a framework of L-BFGS as a novel approach to finite-sum optimization problems. We provide theoretical analyses for both convex and non-convex objectives. Meanwhile, we propose LBFGS-F as a variant where Fisher information matrix is used instead of Hessian information, and prove it applicable to a distributed environment within the popular applications of least-square and cross-entropy losses.In Chapter 4, we develop fast randomized algorithms for solving polynomial optimization problems on the applications of alternating-current optimal power flows (ACOPF) in power system field. The traditional research on power system problem focuses on solvers using second-order method, while no randomized algorithms have been developed. First, we propose a coordinate-descent algorithm as an online solver, applied for solving time-varying optimization problems in power systems. We bound the difference between the current approximate optimal cost generated by our algorithm and the optimal cost for a relaxation using the most recent data from above by a function of the properties of the instance and the rate of change to the instance over time. Second, we focus on a steady-state problem in power systems, and study means of switching from solving a convex relaxation to Newton method working on a non-convex (augmented) Lagrangian of the problem

    International Conference on Continuous Optimization (ICCOPT) 2019 Conference Book

    The Sixth International Conference on Continuous Optimization took place on the campus of the Technical University of Berlin, August 3-8, 2019. The ICCOPT is a flagship conference of the Mathematical Optimization Society (MOS), organized every three years. ICCOPT 2019 was hosted by the Weierstrass Institute for Applied Analysis and Stochastics (WIAS) Berlin. It included a Summer School and a Conference with a series of plenary and semi-plenary talks, organized and contributed sessions, and poster sessions. This book comprises the full conference program. It contains, in particular, the scientific program in survey style as well as with all details, and information on the social program, the venue, special meetings, and more

    Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent

    The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset in order to converge to a solution of sufficient quality.In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state.In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose LeashedSGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD
