35 research outputs found
Training Deep Networks without Learning Rates Through Coin Betting
Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms
Lipschitz Adaptivity with Multiple Learning Rates in Online Learning
We aim to design adaptive online learning algorithms that take advantage of
any special structure that might be present in the learning task at hand, with
as little manual tuning by the user as possible. A fundamental obstacle that
comes up in the design of such adaptive algorithms is to calibrate a so-called
step-size or learning rate hyperparameter depending on variance, gradient
norms, etc. A recent technique promises to overcome this difficulty by
maintaining multiple learning rates in parallel. This technique has been
applied in the MetaGrad algorithm for online convex optimization and the Squint
algorithm for prediction with expert advice. However, in both cases the user
still has to provide in advance a Lipschitz hyperparameter that bounds the norm
of the gradients. Although this hyperparameter is typically not available in
advance, tuning it correctly is crucial: if it is set too small, the methods
may fail completely; but if it is taken too large, performance deteriorates
significantly. In the present work we remove this Lipschitz hyperparameter by
designing new versions of MetaGrad and Squint that adapt to its optimal value
automatically. We achieve this by dynamically updating the set of active
learning rates. For MetaGrad, we further improve the computational efficiency
of handling constraints on the domain of prediction, and we remove the need to
specify the number of rounds in advance.Comment: 22 pages. To appear in COLT 201
Better Parameter-free Stochastic Optimization with ODE Updates for Coin-Betting
Parameter-free stochastic gradient descent (PFSGD) algorithms do not require
setting learning rates while achieving optimal theoretical performance. In
practical applications, however, there remains an empirical gap between tuned
stochastic gradient descent (SGD) and PFSGD. In this paper, we close the
empirical gap with a new parameter-free algorithm based on continuous-time
Coin-Betting on truncated models. The new update is derived through the
solution of an Ordinary Differential Equation (ODE) and solved in a closed
form. We show empirically that this new parameter-free algorithm outperforms
algorithms with the "best default" learning rates and almost matches the
performance of finely tuned baselines without anything to tune
An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits
We present an oracle-efficient relaxation for the adversarial contextual
bandits problem, where the contexts are sequentially drawn i.i.d from a known
distribution and the cost sequence is chosen by an online adversary. Our
algorithm has a regret bound of
and makes at most calls
per round to an offline optimization oracle, where denotes the number of
actions, denotes the number of rounds and denotes the set of
policies. This is the first result to improve the prior best bound of
as obtained by Syrgkanis et
al. at NeurIPS 2016, and the first to match the original bound of Langford and
Zhang at NeurIPS 2007 which was obtained for the stochastic case.Comment: Appears in NeurIPS 202