57 research outputs found
Shampoo: Preconditioned Stochastic Tensor Optimization
Preconditioned gradient methods are among the most general and powerful tools
in optimization. However, preconditioning requires storing and manipulating
prohibitively large matrices. We describe and analyze a new structure-aware
preconditioning algorithm, called Shampoo, for stochastic optimization over
tensor spaces. Shampoo maintains a set of preconditioning matrices, each of
which operates on a single dimension, contracting over the remaining
dimensions. We establish convergence guarantees in the stochastic convex
setting, the proof of which builds upon matrix trace inequalities. Our
experiments with state-of-the-art deep learning models show that Shampoo is
capable of converging considerably faster than commonly used optimizers.
Although it involves a more complex update rule, Shampoo's runtime per step is
comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam
Chasing Ghosts: Competing with Stateful Policies
We consider sequential decision making in a setting where regret is measured
with respect to a set of stateful reference policies, and feedback is limited
to observing the rewards of the actions performed (the so called "bandit"
setting). If either the reference policies are stateless rather than stateful,
or the feedback includes the rewards of all actions (the so called "expert"
setting), previous work shows that the optimal regret grows like
in terms of the number of decision rounds .
The difficulty in our setting is that the decision maker unavoidably loses
track of the internal states of the reference policies, and thus cannot
reliably attribute rewards observed in a certain round to any of the reference
policies. In fact, in this setting it is impossible for the algorithm to
estimate which policy gives the highest (or even approximately highest) total
reward. Nevertheless, we design an algorithm that achieves expected regret that
is sublinear in , of the form . Our algorithm is based
on a certain local repetition lemma that may be of independent interest. We
also show that no algorithm can guarantee expected regret better than
Logistic Regression: Tight Bounds for Stochastic and Online Optimization
The logistic loss function is often advocated in machine learning and
statistics as a smooth and strictly convex surrogate for the 0-1 loss. In this
paper we investigate the question of whether these smoothness and convexity
properties make the logistic loss preferable to other widely considered options
such as the hinge loss. We show that in contrast to known asymptotic bounds, as
long as the number of prediction/optimization iterations is sub exponential,
the logistic loss provides no improvement over a generic non-smooth loss
function such as the hinge loss. In particular we show that the convergence
rate of stochastic logistic optimization is bounded from below by a polynomial
in the diameter of the decision set and the number of prediction iterations,
and provide a matching tight upper bound. This resolves the COLT open problem
of McMahan and Streeter (2012)
Memory-Efficient Adaptive Optimization
Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for
achieving state-of-the-art performance in machine translation and language
modeling. However, these methods maintain second-order statistics for each
parameter, thus introducing significant memory overheads that restrict the size
of the model being used as well as the number of examples in a mini-batch. We
describe an effective and flexible adaptive optimization method with greatly
reduced memory overhead. Our method retains the benefits of per-parameter
adaptivity while allowing significantly larger models and batch sizes. We give
convergence guarantees for our method, and demonstrate its effectiveness in
training very large translation and language models with up to 2-fold speedups
compared to the state-of-the-art
Online Learning with Low Rank Experts
We consider the problem of prediction with expert advice when the losses of
the experts have low-dimensional structure: they are restricted to an unknown
-dimensional subspace. We devise algorithms with regret bounds that are
independent of the number of experts and depend only on the rank . For the
stochastic model we show a tight bound of , and extend it to
a setting of an approximate subspace. For the adversarial model we show an
upper bound of and a lower bound of
- …