19 research outputs found
Training Deep Networks without Learning Rates Through Coin Betting
Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms
Adaptive Bound Optimization for Online Convex Optimization
We introduce a new online convex optimization algorithm that adaptively
chooses its regularization function based on the loss functions observed so
far. This is in contrast to previous algorithms that use a fixed regularization
function such as L2-squared, and modify it only via a single time-dependent
parameter. Our algorithm's regret bounds are worst-case optimal, and for
certain realistic classes of loss functions they are much better than existing
bounds. These bounds are problem-dependent, which means they can exploit the
structure of the actual problem instance. Critically, however, our algorithm
does not need to know this structure in advance. Rather, we prove competitive
guarantees that show the algorithm provides a bound within a constant factor of
the best possible bound (of a certain functional form) in hindsight.Comment: Updates to match final COLT versio
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Stochastic gradient descent algorithms for training linear and kernel
predictors are gaining more and more importance, thanks to their scalability.
While various methods have been proposed to speed up their convergence, the
model selection phase is often ignored. In fact, in theoretical works most of
the time assumptions are made, for example, on the prior knowledge of the norm
of the optimal solution, while in the practical world validation methods remain
the only viable approach. In this paper, we propose a new kernel-based
stochastic gradient descent algorithm that performs model selection while
training, with no parameters to tune, nor any form of cross-validation. The
algorithm builds on recent advancement in online learning theory for
unconstrained settings, to estimate over time the right regularization in a
data-dependent way. Optimal rates of convergence are proved under standard
smoothness assumptions on the target function, using the range space of the
fractional integral operator associated with the kernel
Training Deep Networks without Learning Rates Through Coin Betting
Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms
Learning-Rate-Free Learning by D-Adaptation
The speed of gradient descent for convex Lipschitz functions is highly
dependent on the choice of learning rate. Setting the learning rate to achieve
the optimal convergence rate requires knowing the distance D from the initial
point to the solution set. In this work, we describe a single-loop method, with
no back-tracking or line searches, which does not require knowledge of yet
asymptotically achieves the optimal rate of convergence for the complexity
class of convex Lipschitz functions. Our approach is the first parameter-free
method for this class without additional multiplicative log factors in the
convergence rate. We present extensive experiments for SGD and Adam variants of
our method, where the method automatically matches hand-tuned learning rates
across more than a dozen diverse machine learning problems, including
large-scale vision and language problems. Our method is practical, efficient
and requires no additional function value or gradient evaluations each step. An
open-source implementation is available
(https://github.com/facebookresearch/dadaptation)