77,552 research outputs found
On the linear convergence of the stochastic gradient method with constant step-size
The strong growth condition (SGC) is known to be a sufficient condition for
linear convergence of the stochastic gradient method using a constant step-size
(SGM-CS). In this paper, we provide a necessary condition, for the
linear convergence of SGM-CS, that is weaker than SGC. Moreover, when this
necessary is violated up to a additive perturbation , we show that both
the projected stochastic gradient method using a constant step-size (PSGM-CS)
and the proximal stochastic gradient method exhibit linear convergence to a
noise dominated region, whose distance to the optimal solution is proportional
to
Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization
We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient)
algorithm for constrained smooth finite-sum minimization with a generalized
linear prediction/structure. This class of problems includes empirical risk
minimization with sparse, low-rank, or other structured constraints. The
proposed method is simple to implement, does not require step-size tuning, and
has a constant per-iteration cost that is independent of the dataset size.
Furthermore, as a byproduct of the method we obtain a stochastic estimator of
the Frank-Wolfe gap that can be used as a stopping criterion. Depending on the
setting, the proposed method matches or improves on the best computational
guarantees for Stochastic Frank-Wolfe algorithms. Benchmarks on several
datasets highlight different regimes in which the proposed method exhibits a
faster empirical convergence than related methods. Finally, we provide an
implementation of all considered methods in an open-source package.Comment: To appear in the Proceedings of the 37th International Conference on
Machine Learning, 2020. Main text: 9 pages, 1 figure. Fixes previously found
erro
Prox-DBRO-VR: A Unified Analysis on Decentralized Byzantine-Resilient Composite Stochastic Optimization with Variance Reduction and Non-Asymptotic Convergence Rates
Decentralized Byzantine-resilient stochastic gradient algorithms resolve
efficiently large-scale optimization problems in adverse conditions, such as
malfunctioning agents, software bugs, and cyber attacks. This paper targets on
handling a class of generic composite optimization problems over multi-agent
cyberphysical systems (CPSs), with the existence of an unknown number of
Byzantine agents. Based on the proximal mapping method, two variance-reduced
(VR) techniques, and a norm-penalized approximation strategy, we propose a
decentralized Byzantine-resilient and proximal-gradient algorithmic framework,
dubbed Prox-DBRO-VR, which achieves an optimization and control goal using only
local computations and communications. To reduce asymptotically the variance
generated by evaluating the noisy stochastic gradients, we incorporate two
localized variance-reduced techniques (SAGA and LSVRG) into Prox-DBRO-VR, to
design Prox-DBRO-SAGA and Prox-DBRO-LSVRG. Via analyzing the contraction
relationships among the gradient-learning error, robust consensus condition,
and optimal gap, the theoretical result demonstrates that both Prox-DBRO-SAGA
and Prox-DBRO-LSVRG, with a well-designed constant (resp., decaying) step-size,
converge linearly (resp., sub-linearly) inside an error ball around the optimal
solution to the optimization problem under standard assumptions. The trade-offs
between the convergence accuracy and the number of Byzantine agents in both
linear and sub-linear cases are characterized. In simulation, the effectiveness
and practicability of the proposed algorithms are manifested via resolving a
sparse machine-learning problem over multi-agent CPSs under various Byzantine
attacks.Comment: 14 pages, 0 figure
Minimizing Finite Sums with the Stochastic Average Gradient
We propose the stochastic average gradient (SAG) method for optimizing the
sum of a finite number of smooth convex functions. Like stochastic gradient
(SG) methods, the SAG method's iteration cost is independent of the number of
terms in the sum. However, by incorporating a memory of previous gradient
values the SAG method achieves a faster convergence rate than black-box SG
methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in
general, and when the sum is strongly-convex the convergence rate is improved
from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for
p \textless{} 1. Further, in many cases the convergence rate of the new method
is also faster than black-box deterministic gradient methods, in terms of the
number of gradient evaluations. Numerical experiments indicate that the new
algorithm often dramatically outperforms existing SG and deterministic gradient
methods, and that the performance may be further improved through the use of
non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated
literature follow and discussion of subsequent work, additional Lemma showing
the validity of one of the formulas, somewhat simplified presentation of
Lyapunov bound, included code needed for checking proofs rather than the
polynomials generated by the code, added error regions to the numerical
experiment
Hybrid Deterministic-Stochastic Methods for Data Fitting
Many structured data-fitting applications require the solution of an
optimization problem involving a sum over a potentially large number of
measurements. Incremental gradient algorithms offer inexpensive iterations by
sampling a subset of the terms in the sum. These methods can make great
progress initially, but often slow as they approach a solution. In contrast,
full-gradient methods achieve steady convergence at the expense of evaluating
the full objective and gradient on each iteration. We explore hybrid methods
that exhibit the benefits of both approaches. Rate-of-convergence analysis
shows that by controlling the sample size in an incremental gradient algorithm,
it is possible to maintain the steady convergence rates of full-gradient
methods. We detail a practical quasi-Newton implementation based on this
approach. Numerical experiments illustrate its potential benefits.Comment: 26 pages. Revised proofs of Theorems 2.6 and 3.1, results unchange
Semistochastic Quadratic Bound Methods
Partition functions arise in a variety of settings, including conditional
random fields, logistic regression, and latent gaussian models. In this paper,
we consider semistochastic quadratic bound (SQB) methods for maximum likelihood
inference based on partition function optimization. Batch methods based on the
quadratic bound were recently proposed for this class of problems, and
performed favorably in comparison to state-of-the-art techniques.
Semistochastic methods fall in between batch algorithms, which use all the
data, and stochastic gradient type methods, which use small random selections
at each iteration. We build semistochastic quadratic bound-based methods, and
prove both global convergence (to a stationary point) under very weak
assumptions, and linear convergence rate under stronger assumptions on the
objective. To make the proposed methods faster and more stable, we consider
inexact subproblem minimization and batch-size selection schemes. The efficacy
of SQB methods is demonstrated via comparison with several state-of-the-art
techniques on commonly used datasets.Comment: 11 pages, 1 figur
FROST -- Fast row-stochastic optimization with uncoordinated step-sizes
In this paper, we discuss distributed optimization over directed graphs,
where doubly-stochastic weights cannot be constructed. Most of the existing
algorithms overcome this issue by applying push-sum consensus, which utilizes
column-stochastic weights. The formulation of column-stochastic weights
requires each agent to know (at least) its out-degree, which may be impractical
in e.g., broadcast-based communication protocols. In contrast, we describe
FROST (Fast Row-stochastic-Optimization with uncoordinated STep-sizes), an
optimization algorithm applicable to directed graphs that does not require the
knowledge of out-degrees; the implementation of which is straightforward as
each agent locally assigns weights to the incoming information and locally
chooses a suitable step-size. We show that FROST converges linearly to the
optimal solution for smooth and strongly-convex functions given that the
largest step-size is positive and sufficiently small.Comment: Submitted for journal publication, currently under revie
- …