Search CORE

57 research outputs found

Shampoo: Preconditioned Stochastic Tensor Optimization

Author: Gupta Vineet
Koren Tomer
Singer Yoram
Publication venue
Publication date: 01/01/2018
Field of study

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam

arXiv.org e-Print Archive

Princeton University Open Access Repository

Chasing Ghosts: Competing with Stateful Policies

Author: Feige Uriel
Koren Tomer
Tennenholtz Moshe
Publication venue
Publication date: 29/07/2014
Field of study

We consider sequential decision making in a setting where regret is measured with respect to a set of stateful reference policies, and feedback is limited to observing the rewards of the actions performed (the so called "bandit" setting). If either the reference policies are stateless rather than stateful, or the feedback includes the rewards of all actions (the so called "expert" setting), previous work shows that the optimal regret grows like

\Theta(\sqrt{T})

in terms of the number of decision rounds

T

. The difficulty in our setting is that the decision maker unavoidably loses track of the internal states of the reference policies, and thus cannot reliably attribute rewards observed in a certain round to any of the reference policies. In fact, in this setting it is impossible for the algorithm to estimate which policy gives the highest (or even approximately highest) total reward. Nevertheless, we design an algorithm that achieves expected regret that is sublinear in

T

, of the form

O( T/\log^{1/4}{T})

. Our algorithm is based on a certain local repetition lemma that may be of independent interest. We also show that no algorithm can guarantee expected regret better than

O( T/\log^{3/2} T)

arXiv.org e-Print Archive

Crossref

Logistic Regression: Tight Bounds for Stochastic and Online Optimization

Author: Hazan Elad
Koren Tomer
Levy Kfir Y.
Publication venue
Publication date: 15/05/2014
Field of study

The logistic loss function is often advocated in machine learning and statistics as a smooth and strictly convex surrogate for the 0-1 loss. In this paper we investigate the question of whether these smoothness and convexity properties make the logistic loss preferable to other widely considered options such as the hinge loss. We show that in contrast to known asymptotic bounds, as long as the number of prediction/optimization iterations is sub exponential, the logistic loss provides no improvement over a generic non-smooth loss function such as the hinge loss. In particular we show that the convergence rate of stochastic logistic optimization is bounded from below by a polynomial in the diameter of the decision set and the number of prediction iterations, and provide a matching tight upper bound. This resolves the COLT open problem of McMahan and Streeter (2012)

arXiv.org e-Print Archive

CiteSeerX

Memory-Efficient Adaptive Optimization

Author: Anil Rohan
Gupta Vineet
Koren Tomer
Singer Yoram
Publication venue
Publication date: 01/01/2019
Field of study

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter, thus introducing significant memory overheads that restrict the size of the model being used as well as the number of examples in a mini-batch. We describe an effective and flexible adaptive optimization method with greatly reduced memory overhead. Our method retains the benefits of per-parameter adaptivity while allowing significantly larger models and batch sizes. We give convergence guarantees for our method, and demonstrate its effectiveness in training very large translation and language models with up to 2-fold speedups compared to the state-of-the-art

arXiv.org e-Print Archive

Princeton University Open Access Repository

Online Learning with Low Rank Experts

Author: Hazan Elad
Koren Tomer
Livni Roi
Mansour Yishay
Publication venue
Publication date: 01/01/2016
Field of study

We consider the problem of prediction with expert advice when the losses of the experts have low-dimensional structure: they are restricted to an unknown

d

-dimensional subspace. We devise algorithms with regret bounds that are independent of the number of experts and depend only on the rank

d

. For the stochastic model we show a tight bound of

\Theta(\sqrt{dT})

, and extend it to a setting of an approximate

d

subspace. For the adversarial model we show an upper bound of

O(d\sqrt{T})

and a lower bound of

\Omega(\sqrt{dT})

arXiv.org e-Print Archive

Princeton University Open Access Repository