17 research outputs found
Adaptive Bound Optimization for Online Convex Optimization
We introduce a new online convex optimization algorithm that adaptively
chooses its regularization function based on the loss functions observed so
far. This is in contrast to previous algorithms that use a fixed regularization
function such as L2-squared, and modify it only via a single time-dependent
parameter. Our algorithm's regret bounds are worst-case optimal, and for
certain realistic classes of loss functions they are much better than existing
bounds. These bounds are problem-dependent, which means they can exploit the
structure of the actual problem instance. Critically, however, our algorithm
does not need to know this structure in advance. Rather, we prove competitive
guarantees that show the algorithm provides a bound within a constant factor of
the best possible bound (of a certain functional form) in hindsight.Comment: Updates to match final COLT versio
Lipschitz Adaptivity with Multiple Learning Rates in Online Learning
We aim to design adaptive online learning algorithms that take advantage of
any special structure that might be present in the learning task at hand, with
as little manual tuning by the user as possible. A fundamental obstacle that
comes up in the design of such adaptive algorithms is to calibrate a so-called
step-size or learning rate hyperparameter depending on variance, gradient
norms, etc. A recent technique promises to overcome this difficulty by
maintaining multiple learning rates in parallel. This technique has been
applied in the MetaGrad algorithm for online convex optimization and the Squint
algorithm for prediction with expert advice. However, in both cases the user
still has to provide in advance a Lipschitz hyperparameter that bounds the norm
of the gradients. Although this hyperparameter is typically not available in
advance, tuning it correctly is crucial: if it is set too small, the methods
may fail completely; but if it is taken too large, performance deteriorates
significantly. In the present work we remove this Lipschitz hyperparameter by
designing new versions of MetaGrad and Squint that adapt to its optimal value
automatically. We achieve this by dynamically updating the set of active
learning rates. For MetaGrad, we further improve the computational efficiency
of handling constraints on the domain of prediction, and we remove the need to
specify the number of rounds in advance.Comment: 22 pages. To appear in COLT 201
Lipschitz Adaptivity with Multiple Learning Rates in Online Learning
We aim to design adaptive online learning algorithms that take advantage of any special structure
that might be present in the learning task at hand, with as little manual tuning by the user as possible.
A fundamental obstacle that comes up in the design of such adaptive algorithms is to calibrate
a so-called step-size or learning rate hyperparameter depending on variance, gradient norms, etc.
A recent technique promises to overcome this difficulty by maintaining multiple learning rates in
parallel. This technique has been applied in the MetaGrad algorithm for online convex optimization
and the Squint algorithm for prediction with expert advice. However, in both cases the user still has
to provide in advance a Lipschitz hyperparameter that bounds the norm of the gradients. Although
this hyperparameter is typically not available in advance, tuning it correctly is crucial: if it is set
too small, the methods may fail completely; but if it is taken too large, performance deteriorates
significantly. In the present work we remove this Lipschitz hyperparameter by designing new
versions of MetaGrad and Squint that adapt to its optimal value automatically. We achieve this
by dynamically updating the set of active learning rates. For MetaGrad, we further improve the
computational efficiency of handling constraints on the domain of prediction, and we remove the
need to specify the number of rounds in advance
MetaGrad: Multiple Learning Rates in Online Learning
Analysis and Stochastic
MetaGrad: Adaptation using Multiple Learning Rates in Online Learning
We provide a new adaptive method for online convex optimization, MetaGrad,
that is robust to general convex losses but achieves faster rates for a broad
class of special functions, including exp-concave and strongly convex
functions, but also various types of stochastic and non-stochastic functions
without any curvature. We prove this by drawing a connection to the Bernstein
condition, which is known to imply fast rates in offline statistical learning.
MetaGrad further adapts automatically to the size of the gradients. Its main
feature is that it simultaneously considers multiple learning rates, which are
weighted directly proportional to their empirical performance on the data using
a new meta-algorithm. We provide three versions of MetaGrad. The full matrix
version maintains a full covariance matrix and is applicable to learning tasks
for which we can afford update time quadratic in the dimension. The other two
versions provide speed-ups for high-dimensional learning tasks with an update
time that is linear in the dimension: one is based on sketching, the other on
running a separate copy of the basic algorithm per coordinate. We evaluate all
versions of MetaGrad on benchmark online classification and regression tasks,
on which they consistently outperform both online gradient descent and AdaGrad
Stochastic functional descent for learning Support Vector Machines
We present a novel method for learning Support Vector Machines (SVMs) in the online setting. Our method is generally applicable in that it handles the online learning of the binary, multiclass, and structural SVMs in a unified view.
The SVM learning problem consists of optimizing a convex objective function that is composed of two parts: the hinge loss and quadratic regularization. To date, the predominant family of approaches for online SVM learning has been gradient-based methods, such as Stochastic Gradient Descent (SGD). Unfortunately, we note that there are two drawbacks in such approaches: first, gradient-based methods are based on a local linear approximation to the function being optimized, but since the hinge loss is piecewise-linear and nonsmooth, this approximation can be ill-behaved. Second, existing online SVM learning approaches share the same problem formulation with batch SVM learning methods, and they all need to tune a fixed global regularization parameter by cross validation. On the one hand, global regularization is ineffective in handling local irregularities encountered in the online setting; on the other hand, even though the learning problem for a particular global regularization parameter value may be efficiently solved, repeatedly solving for a wide range of values can be costly.
We intend to tackle these two problems with our approach. To address the first problem, we propose to perform implicit online update steps to optimize the hinge loss, as opposed to explicit (or gradient-based) updates that utilize subgradients to perform local linearization. Regarding the second problem, we propose to enforce local regularization that is applied to individual classifier update steps, rather than having a fixed global regularization term.
Our theoretical analysis suggests that our classifier update steps progressively optimize the structured hinge loss, with the rate controlled by a sequence of regularization parameters; setting these parameters is analogous to setting the stepsizes in gradient-based methods. In addition, we give sufficient conditions for the algorithm's convergence. Experimentally, our online algorithm can match optimal classification performances given by other state-of-the-art online SVM learning methods, as well as batch learning methods, after only one or two passes over the training data. More importantly, our algorithm can attain these results without doing cross validation, while all other methods must perform time-consuming cross validation to determine the optimal choice of the global regularization parameter