    Towards stability and optimality in stochastic gradient descent

    Iterative procedures for parameter estimation based on stochastic gradient descent allow the estimation to scale to massive data sets. However, in both theory and practice, they suffer from numerical instability. Moreover, they are statistically inefficient as estimators of the true parameter value. To address these two issues, we propose a new iterative procedure termed averaged implicit SGD (AI-SGD). For statistical efficiency, AI-SGD employs averaging of the iterates, which achieves the optimal Cram\'{e}r-Rao bound under strong convexity, i.e., it is an optimal unbiased estimator of the true parameter value. For numerical stability, AI-SGD employs an implicit update at each iteration, which is related to proximal operators in optimization. In practice, AI-SGD achieves competitive performance with other state-of-the-art procedures. Furthermore, it is more stable than averaging procedures that do not employ proximal updates, and is simple to implement as it requires fewer tunable hyperparameters than procedures that do employ proximal updates.Comment: Appears in Artificial Intelligence and Statistics, 201

    Fully Implicit Online Learning

    Regularized online learning is widely used in machine learning applications. In online learning, performing exact minimization (i.e.,i.e., implicit update) is known to be beneficial to the numerical stability and structure of solution. In this paper we study a class of regularized online algorithms without linearizing the loss function or the regularizer, which we call \emph{fully implicit online learning} (FIOL). We show that for arbitrary Bregman divergence, FIOL has the O(T)O(\sqrt{T}) regret for general convex setting and O(logT)O(\log T) regret for strongly convex setting, and the regret has an one-step improvement effect because it avoids the approximation error of linearization. Then we propose efficient algorithms to solve the subproblem of FIOL. We show that even if the solution of the subproblem has no closed form, it can be solved with complexity comparable to the linearized online algoritms. Experiments validate the proposed approaches.Comment: 17 page

    A Variational Analysis of Stochastic Gradient Algorithms

    Stochastic Gradient Descent (SGD) is an important algorithm in machine learning. With constant learning rates, it is a stochastic process that, after an initial phase of convergence, generates samples from a stationary distribution. We show that SGD with constant rates can be effectively used as an approximate posterior inference algorithm for probabilistic modeling. Specifically, we show how to adjust the tuning parameters of SGD such as to match the resulting stationary distribution to the posterior. This analysis rests on interpreting SGD as a continuous-time stochastic process and then minimizing the Kullback-Leibler divergence between its stationary distribution and the target posterior. (This is in the spirit of variational inference.) In more detail, we model SGD as a multivariate Ornstein-Uhlenbeck process and then use properties of this process to derive the optimal parameters. This theoretical framework also connects SGD to modern scalable inference algorithms; we analyze the recently proposed stochastic gradient Fisher scoring under this perspective. We demonstrate that SGD with properly chosen constant rates gives a new way to optimize hyperparameters in probabilistic models.Comment: 8 pages, 3 figure

    Convergence diagnostics for stochastic gradient descent with constant step size

    Many iterative procedures in stochastic optimization exhibit a transient phase followed by a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in that region, commonly around a single point. In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant learning rate. We present theory and experiments suggesting that the region where the proposed diagnostic is activated coincides with the convergence region. For a class of loss functions, we derive a closed-form solution describing such region. Finally, we suggest an application to speed up convergence of stochastic gradient descent by halving the learning rate each time stationarity is detected. This leads to a new variant of stochastic gradient descent, which in many settings is comparable to state-of-art.Comment: Accepted to Artificial Intelligence and Statistics, 201

    Stochastic proximal splitting algorithm for composite minimization

    Supported by the recent contributions in multiple branches, the first-order splitting algorithms became central for structured nonsmooth optimization. In the large-scale or noisy contexts, when only stochastic information on the smooth part of the objective function is available, the extension of proximal gradient schemes to stochastic oracles is based on proximal tractability of the nonsmooth component and it has been deeply analyzed in the literature. However, there remained gaps illustrated by composite models where the nonsmooth term is not proximally tractable anymore. In this note we tackle composite optimization problems, where the access only to stochastic information on both smooth and nonsmooth components is assumed, using a stochastic proximal first-order scheme with stochastic proximal updates. We provide O(1k)\mathcal{O}\left( \frac{1}{k} \right) the iteration complexity (in expectation of squared distance to the optimal set) under the strong convexity assumption on the objective function. Empirical behavior is illustrated by numerical tests on parametric sparse representation models

    Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models

    Stochastic optimization lies at the core of most statistical learning models. The recent great development of stochastic algorithmic tools focused significantly onto proximal gradient iterations, in order to find an efficient approach for nonsmooth (composite) population risk functions. The complexity of finding optimal predictors by minimizing regularized risk is largely understood for simple regularizations such as 1/2\ell_1/\ell_2 norms. However, more complex properties desired for the predictor necessitates highly difficult regularizers as used in grouped lasso or graph trend filtering. In this chapter we develop and analyze minibatch variants of stochastic proximal gradient algorithm for general composite objective functions with stochastic nonsmooth components. We provide iteration complexity for constant and variable stepsize policies obtaining that, for minibatch size NN, after O(1Nϵ)\mathcal{O}(\frac{1}{N\epsilon}) iterations ϵ\epsilon-suboptimality is attained in expected quadratic distance to optimal solution. The numerical tests on 2\ell_2-regularized SVMs and parametric sparse representation problems confirm the theoretical behaviour and surpasses minibatch SGD performance

    Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization

    A very popular approach for solving stochastic optimization problems is the stochastic gradient descent method (SGD). Although the SGD iteration is computationally cheap and the practical performance of this method may be satisfactory under certain circumstances, there is recent evidence of its convergence difficulties and instability for unappropriate parameters choice. To avoid these drawbacks naturally introduced by the SGD scheme, the stochastic proximal point algorithms have been recently considered in the literature. We introduce a new variant of the stochastic proximal point method (SPP) for solving stochastic convex optimization problems subject to (in)finite intersection of constraints satisfying a linear regularity type condition. For the newly introduced SPP scheme we prove new nonasymptotic convergence results. In particular, for convex and Lipschitz continuous objective functions, we prove nonasymptotic estimates for the rate of convergence in terms of the expected value function gap of order O(1/k1/2)\mathcal{O}(1/k^{1/2}), where kk is the iteration counter. We also derive better nonasymptotic bounds for the rate of convergence in terms of expected quadratic distance from the iterates to the optimal solution for smooth strongly convex objective functions, which in the best case is of order O(1/k)\mathcal{O}(1/k). Since these convergence rates can be attained by our SPP algorithm only under some natural restrictions on the stepsize, we also introduce a restarting variant of SPP method that overcomes these difficulties and derive the corresponding nonasymptotic convergence rates. Numerical evidence supports the effectiveness of our methods in real-world problems

    Sub-linear convergence of a stochastic proximal iteration method in Hilbert space

    We consider a stochastic version of the proximal point algorithm for optimization problems posed on a Hilbert space. A typical application of this is supervised learning. While the method is not new, it has not been extensively analyzed in this form. Indeed, most related results are confined to the finite-dimensional setting, where error bounds could depend on the dimension of the space. On the other hand, the few existing results in the infinite-dimensional setting only prove very weak types of convergence, owing to weak assumptions on the problem. In particular, there are no results that show convergence with a rate. In this article, we bridge these two worlds by assuming more regularity of the optimization problem, which allows us to prove convergence with an (optimal) sub-linear rate also in an infinite-dimensional setting. We illustrate these results by discretizing a concrete infinite-dimensional classification problem with varying degrees of accuracy.Comment: Corrected mistake in metadat

    Stochastic Gradient Descent as Approximate Bayesian Inference

    Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.Comment: 35 pages, published version (JMLR 2017

    General convergence analysis of stochastic first order methods for composite optimization

    In this paper we consider stochastic composite convex optimization problems with the objective function satisfying a stochastic bounded gradient condition, with or without a quadratic functional growth property. These models include the most well-known classes of objective functions analyzed in the literature: non-smooth Lipschitz functions and composition of a (potentially) non-smooth function and a smooth function, with or without strong convexity. Based on the flexibility offered by our optimization model we consider several variants of stochastic first order methods, such as the stochastic proximal gradient and the stochastic proximal point algorithms. Usually, the convergence theory for these methods has been derived for simple stochastic optimization models satisfying restrictive assumptions, the rates are in general sublinear and hold only for specific decreasing stepsizes. Hence, we analyze the convergence rates of stochastic first order methods with constant or variable stepsize under general assumptions covering a large class of objective functions. For constant stepsize we show that these methods can achieve linear convergence rate up to a constant proportional to the stepsize and under some strong stochastic bounded gradient condition even pure linear convergence. Moreover, when a variable stepsize is chosen we derive sublinear convergence rates for these stochastic first order methods. Finally, the stochastic gradient mapping and the Moreau smoothing mapping introduced in the present paper lead to simple and intuitive proofs.Comment: The results of this paper have been obtained by the author since 2017 and presented at several international conferences: e.g., Conference on Recent Advances in Artificial Intelligence, June 2017; KAUST Research Workshop on Optimization and Big Data, February 201