16 research outputs found
Towards stability and optimality in stochastic gradient descent
Iterative procedures for parameter estimation based on stochastic gradient
descent allow the estimation to scale to massive data sets. However, in both
theory and practice, they suffer from numerical instability. Moreover, they are
statistically inefficient as estimators of the true parameter value. To address
these two issues, we propose a new iterative procedure termed averaged implicit
SGD (AI-SGD). For statistical efficiency, AI-SGD employs averaging of the
iterates, which achieves the optimal Cram\'{e}r-Rao bound under strong
convexity, i.e., it is an optimal unbiased estimator of the true parameter
value. For numerical stability, AI-SGD employs an implicit update at each
iteration, which is related to proximal operators in optimization. In practice,
AI-SGD achieves competitive performance with other state-of-the-art procedures.
Furthermore, it is more stable than averaging procedures that do not employ
proximal updates, and is simple to implement as it requires fewer tunable
hyperparameters than procedures that do employ proximal updates.Comment: Appears in Artificial Intelligence and Statistics, 201
Fully Implicit Online Learning
Regularized online learning is widely used in machine learning applications.
In online learning, performing exact minimization ( implicit update) is
known to be beneficial to the numerical stability and structure of solution. In
this paper we study a class of regularized online algorithms without
linearizing the loss function or the regularizer, which we call \emph{fully
implicit online learning} (FIOL). We show that for arbitrary Bregman
divergence, FIOL has the regret for general convex setting and
regret for strongly convex setting, and the regret has an one-step
improvement effect because it avoids the approximation error of linearization.
Then we propose efficient algorithms to solve the subproblem of FIOL. We show
that even if the solution of the subproblem has no closed form, it can be
solved with complexity comparable to the linearized online algoritms.
Experiments validate the proposed approaches.Comment: 17 page
A Variational Analysis of Stochastic Gradient Algorithms
Stochastic Gradient Descent (SGD) is an important algorithm in machine
learning. With constant learning rates, it is a stochastic process that, after
an initial phase of convergence, generates samples from a stationary
distribution. We show that SGD with constant rates can be effectively used as
an approximate posterior inference algorithm for probabilistic modeling.
Specifically, we show how to adjust the tuning parameters of SGD such as to
match the resulting stationary distribution to the posterior. This analysis
rests on interpreting SGD as a continuous-time stochastic process and then
minimizing the Kullback-Leibler divergence between its stationary distribution
and the target posterior. (This is in the spirit of variational inference.) In
more detail, we model SGD as a multivariate Ornstein-Uhlenbeck process and then
use properties of this process to derive the optimal parameters. This
theoretical framework also connects SGD to modern scalable inference
algorithms; we analyze the recently proposed stochastic gradient Fisher scoring
under this perspective. We demonstrate that SGD with properly chosen constant
rates gives a new way to optimize hyperparameters in probabilistic models.Comment: 8 pages, 3 figure
Convergence diagnostics for stochastic gradient descent with constant step size
Many iterative procedures in stochastic optimization exhibit a transient
phase followed by a stationary phase. During the transient phase the procedure
converges towards a region of interest, and during the stationary phase the
procedure oscillates in that region, commonly around a single point. In this
paper, we develop a statistical diagnostic test to detect such phase transition
in the context of stochastic gradient descent with constant learning rate. We
present theory and experiments suggesting that the region where the proposed
diagnostic is activated coincides with the convergence region. For a class of
loss functions, we derive a closed-form solution describing such region.
Finally, we suggest an application to speed up convergence of stochastic
gradient descent by halving the learning rate each time stationarity is
detected. This leads to a new variant of stochastic gradient descent, which in
many settings is comparable to state-of-art.Comment: Accepted to Artificial Intelligence and Statistics, 201
Stochastic proximal splitting algorithm for composite minimization
Supported by the recent contributions in multiple branches, the first-order
splitting algorithms became central for structured nonsmooth optimization. In
the large-scale or noisy contexts, when only stochastic information on the
smooth part of the objective function is available, the extension of proximal
gradient schemes to stochastic oracles is based on proximal tractability of the
nonsmooth component and it has been deeply analyzed in the literature. However,
there remained gaps illustrated by composite models where the nonsmooth term is
not proximally tractable anymore. In this note we tackle composite optimization
problems, where the access only to stochastic information on both smooth and
nonsmooth components is assumed, using a stochastic proximal first-order scheme
with stochastic proximal updates. We provide the iteration complexity (in expectation of squared distance to the
optimal set) under the strong convexity assumption on the objective function.
Empirical behavior is illustrated by numerical tests on parametric sparse
representation models
Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models
Stochastic optimization lies at the core of most statistical learning models.
The recent great development of stochastic algorithmic tools focused
significantly onto proximal gradient iterations, in order to find an efficient
approach for nonsmooth (composite) population risk functions. The complexity of
finding optimal predictors by minimizing regularized risk is largely understood
for simple regularizations such as norms. However, more complex
properties desired for the predictor necessitates highly difficult regularizers
as used in grouped lasso or graph trend filtering. In this chapter we develop
and analyze minibatch variants of stochastic proximal gradient algorithm for
general composite objective functions with stochastic nonsmooth components. We
provide iteration complexity for constant and variable stepsize policies
obtaining that, for minibatch size , after
iterations suboptimality is
attained in expected quadratic distance to optimal solution. The numerical
tests on regularized SVMs and parametric sparse representation
problems confirm the theoretical behaviour and surpasses minibatch SGD
performance
Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization
A very popular approach for solving stochastic optimization problems is the
stochastic gradient descent method (SGD). Although the SGD iteration is
computationally cheap and the practical performance of this method may be
satisfactory under certain circumstances, there is recent evidence of its
convergence difficulties and instability for unappropriate parameters choice.
To avoid these drawbacks naturally introduced by the SGD scheme, the stochastic
proximal point algorithms have been recently considered in the literature. We
introduce a new variant of the stochastic proximal point method (SPP) for
solving stochastic convex optimization problems subject to (in)finite
intersection of constraints satisfying a linear regularity type condition. For
the newly introduced SPP scheme we prove new nonasymptotic convergence results.
In particular, for convex and Lipschitz continuous objective functions, we
prove nonasymptotic estimates for the rate of convergence in terms of the
expected value function gap of order , where is the
iteration counter. We also derive better nonasymptotic bounds for the rate of
convergence in terms of expected quadratic distance from the iterates to the
optimal solution for smooth strongly convex objective functions, which in the
best case is of order . Since these convergence rates can be
attained by our SPP algorithm only under some natural restrictions on the
stepsize, we also introduce a restarting variant of SPP method that overcomes
these difficulties and derive the corresponding nonasymptotic convergence
rates. Numerical evidence supports the effectiveness of our methods in
real-world problems
Sub-linear convergence of a stochastic proximal iteration method in Hilbert space
We consider a stochastic version of the proximal point algorithm for
optimization problems posed on a Hilbert space. A typical application of this
is supervised learning. While the method is not new, it has not been
extensively analyzed in this form. Indeed, most related results are confined to
the finite-dimensional setting, where error bounds could depend on the
dimension of the space. On the other hand, the few existing results in the
infinite-dimensional setting only prove very weak types of convergence, owing
to weak assumptions on the problem. In particular, there are no results that
show convergence with a rate. In this article, we bridge these two worlds by
assuming more regularity of the optimization problem, which allows us to prove
convergence with an (optimal) sub-linear rate also in an infinite-dimensional
setting. We illustrate these results by discretizing a concrete
infinite-dimensional classification problem with varying degrees of accuracy.Comment: Corrected mistake in metadat
Stochastic Gradient Descent as Approximate Bayesian Inference
Stochastic Gradient Descent with a constant learning rate (constant SGD)
simulates a Markov chain with a stationary distribution. With this perspective,
we derive several new results. (1) We show that constant SGD can be used as an
approximate Bayesian posterior inference algorithm. Specifically, we show how
to adjust the tuning parameters of constant SGD to best match the stationary
distribution to a posterior, minimizing the Kullback-Leibler divergence between
these two distributions. (2) We demonstrate that constant SGD gives rise to a
new variational EM algorithm that optimizes hyperparameters in complex
probabilistic models. (3) We also propose SGD with momentum for sampling and
show how to adjust the damping coefficient accordingly. (4) We analyze MCMC
algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we
quantify the approximation errors due to finite learning rates. Finally (5), we
use the stochastic process perspective to give a short proof of why Polyak
averaging is optimal. Based on this idea, we propose a scalable approximate
MCMC algorithm, the Averaged Stochastic Gradient Sampler.Comment: 35 pages, published version (JMLR 2017
General convergence analysis of stochastic first order methods for composite optimization
In this paper we consider stochastic composite convex optimization problems
with the objective function satisfying a stochastic bounded gradient condition,
with or without a quadratic functional growth property. These models include
the most well-known classes of objective functions analyzed in the literature:
non-smooth Lipschitz functions and composition of a (potentially) non-smooth
function and a smooth function, with or without strong convexity. Based on the
flexibility offered by our optimization model we consider several variants of
stochastic first order methods, such as the stochastic proximal gradient and
the stochastic proximal point algorithms. Usually, the convergence theory for
these methods has been derived for simple stochastic optimization models
satisfying restrictive assumptions, the rates are in general sublinear and hold
only for specific decreasing stepsizes. Hence, we analyze the convergence rates
of stochastic first order methods with constant or variable stepsize under
general assumptions covering a large class of objective functions. For constant
stepsize we show that these methods can achieve linear convergence rate up to a
constant proportional to the stepsize and under some strong stochastic bounded
gradient condition even pure linear convergence. Moreover, when a variable
stepsize is chosen we derive sublinear convergence rates for these stochastic
first order methods. Finally, the stochastic gradient mapping and the Moreau
smoothing mapping introduced in the present paper lead to simple and intuitive
proofs.Comment: The results of this paper have been obtained by the author since 2017
and presented at several international conferences: e.g., Conference on
Recent Advances in Artificial Intelligence, June 2017; KAUST Research
Workshop on Optimization and Big Data, February 201