226 research outputs found
A Modern Introduction to Online Learning
In this monograph, I introduce the basic concepts of Online Learning through
a modern view of Online Convex Optimization. Here, online learning refers to
the framework of regret minimization under worst-case assumptions. I present
first-order and second-order algorithms for online learning with convex losses,
in Euclidean and non-Euclidean settings. All the algorithms are clearly
presented as instantiation of Online Mirror Descent or
Follow-The-Regularized-Leader and their variants. Particular attention is given
to the issue of tuning the parameters of the algorithms and learning in
unbounded domains, through adaptive and parameter-free online learning
algorithms. Non-convex losses are dealt through convex surrogate losses and
through randomization. The bandit setting is also briefly discussed, touching
on the problem of adversarial and stochastic multi-armed bandits. These notes
do not require prior knowledge of convex analysis and all the required
mathematical tools are rigorously explained. Moreover, all the proofs have been
carefully chosen to be as simple and as short as possible.Comment: Fixed more typos, added more history bits, added local norms bounds
for OMD and FTR
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Stochastic gradient descent algorithms for training linear and kernel
predictors are gaining more and more importance, thanks to their scalability.
While various methods have been proposed to speed up their convergence, the
model selection phase is often ignored. In fact, in theoretical works most of
the time assumptions are made, for example, on the prior knowledge of the norm
of the optimal solution, while in the practical world validation methods remain
the only viable approach. In this paper, we propose a new kernel-based
stochastic gradient descent algorithm that performs model selection while
training, with no parameters to tune, nor any form of cross-validation. The
algorithm builds on recent advancement in online learning theory for
unconstrained settings, to estimate over time the right regularization in a
data-dependent way. Optimal rates of convergence are proved under standard
smoothness assumptions on the target function, using the range space of the
fractional integral operator associated with the kernel
Momentum-based variance reduction in non-convex SGD
Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent
in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic
gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large “mega-batches” in order to achieve their improved results. We present a new algorithm, Storm, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses F, Storm finds a point x with E[k∇F(x)k] ≤ O(1 /√ T + σ^1/3 /T^1/3) in T iterations with σ^2 variance in the gradients, matching the optimal rate and without requiring knowledge of σ.https://arxiv.org/pdf/1905.10018.pdfPublished versio
Momentum-Based Variance Reduction in Non-Convex SGD
Variance reduction has emerged in recent years as a strong competitor to
stochastic gradient descent in non-convex problems, providing the first
algorithms to improve upon the converge rate of stochastic gradient descent for
finding first-order critical points. However, variance reduction techniques
typically require carefully tuned learning rates and willingness to use
excessively large "mega-batches" in order to achieve their improved results. We
present a new algorithm, STORM, that does not require any batches and makes use
of adaptive learning rates, enabling simpler implementation and less
hyperparameter tuning. Our technique for removing the batches uses a variant of
momentum to achieve variance reduction in non-convex optimization. On smooth
losses , STORM finds a point with in iterations
with variance in the gradients, matching the optimal rate but
without requiring knowledge of .Comment: Added Ac
Training Deep Networks without Learning Rates Through Coin Betting
Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms
Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations
We study algorithms for online linear optimization in Hilbert spaces,
focusing on the case where the player is unconstrained. We develop a novel
characterization of a large class of minimax algorithms, recovering, and even
improving, several previous results as immediate corollaries. Moreover, using
our tools, we develop an algorithm that provides a regret bound of
, where is
the norm of an arbitrary comparator and both and are unknown to
the player. This bound is optimal up to terms. When is
known, we derive an algorithm with an optimal regret bound (up to constant
factors). For both the known and unknown case, a Normal approximation to
the conditional value of the game proves to be the key analysis tool.Comment: Proceedings of the 27th Annual Conference on Learning Theory (COLT
2014
Parameter-free locally differentially private stochastic subgradient descent
https://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfhttps://arxiv.org/pdf/1911.09564.pdfPublished versio
- …