5,121 research outputs found
A new regret analysis for Adam-type algorithms
International audienceIn this paper, we focus on a theory-practice gap for Adam and its variants (AMSgrad, AdamNC, etc.). In practice, these algorithms are used with a constant first-order moment parameter β 1 (typically between 0.9 and 0.99). In theory, regret guarantees for online convex optimization require a rapidly decaying β 1 → 0 schedule. We show that this is an artifact of the standard analysis, and we propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant β 1 , without further assumptions. We also demonstrate the flexibility of our analysis on a wide range of different algorithms and settings
Variants of RMSProp and Adagrad with Logarithmic Regret Bounds
Adaptive gradient methods have become recently very popular, in particular as
they have been shown to be useful in the training of deep neural networks. In
this paper we have analyzed RMSProp, originally proposed for the training of
deep neural networks, in the context of online convex optimization and show
-type regret bounds. Moreover, we propose two variants SC-Adagrad and
SC-RMSProp for which we show logarithmic regret bounds for strongly convex
functions. Finally, we demonstrate in the experiments that these new variants
outperform other adaptive gradient techniques or stochastic gradient descent in
the optimization of strongly convex functions as well as in training of deep
neural networks.Comment: ICML 2017, 16 pages, 23 figure
Variants of RMSProp and Adagrad with Logarithmic Regret Bounds
Adaptive gradient methods have become recently very popular, in particular as
they have been shown to be useful in the training of deep neural networks. In
this paper we have analyzed RMSProp, originally proposed for the training of
deep neural networks, in the context of online convex optimization and show
-type regret bounds. Moreover, we propose two variants SC-Adagrad and
SC-RMSProp for which we show logarithmic regret bounds for strongly convex
functions. Finally, we demonstrate in the experiments that these new variants
outperform other adaptive gradient techniques or stochastic gradient descent in
the optimization of strongly convex functions as well as in training of deep
neural networks.Comment: ICML 2017, 16 pages, 23 figure
Dual Averaging Method for Online Graph-structured Sparsity
Online learning algorithms update models via one sample per iteration, thus
efficient to process large-scale datasets and useful to detect malicious events
for social benefits, such as disease outbreak and traffic congestion on the
fly. However, existing algorithms for graph-structured models focused on the
offline setting and the least square loss, incapable for online setting, while
methods designed for online setting cannot be directly applied to the problem
of complex (usually non-convex) graph-structured sparsity model. To address
these limitations, in this paper we propose a new algorithm for
graph-structured sparsity constraint problems under online setting, which we
call \textsc{GraphDA}. The key part in \textsc{GraphDA} is to project both
averaging gradient (in dual space) and primal variables (in primal space) onto
lower dimensional subspaces, thus capturing the graph-structured sparsity
effectively. Furthermore, the objective functions assumed here are generally
convex so as to handle different losses for online learning settings. To the
best of our knowledge, \textsc{GraphDA} is the first online learning algorithm
for graph-structure constrained optimization problems. To validate our method,
we conduct extensive experiments on both benchmark graph and real-world graph
datasets. Our experiment results show that, compared to other baseline methods,
\textsc{GraphDA} not only improves classification performance, but also
successfully captures graph-structured features more effectively, hence
stronger interpretability.Comment: 11 pages, 14 figure
- …