35 research outputs found
Efficient Online Convex Optimization with Adaptively Minimax Optimal Dynamic Regret
We introduce an online convex optimization algorithm using projected
sub-gradient descent with ideal adaptive learning rates, where each computation
is efficiently done in a sequential manner. For the first time in the
literature, this algorithm provides an adaptively minimax optimal dynamic
regret guarantee for a sequence of convex functions without any restrictions --
such as strong convexity, smoothness or even Lipschitz continuity -- against a
comparator decision sequence with bounded total successive changes. We show
optimality by generating the worst-case dynamic regret adaptive lower bound,
which constitutes of actual sub-gradient norms and matches with our guarantees.
We discuss the advantages of our algorithm as opposed to adaptive projection
with sub-gradient self outer products and also derive the extension for
independent learning in each decision coordinate separately. Additionally, we
demonstrate how to best preserve our guarantees when the bound on total
successive changes in the dynamic comparator sequence grows as time goes, in a
truly online manner.Comment: 10 pages, 1 figure, preprint, [v0] 201
Gaussian Process Classification Bandits
Classification bandits are multi-armed bandit problems whose task is to
classify a given set of arms into either positive or negative class depending
on whether the rate of the arms with the expected reward of at least h is not
less than w for given thresholds h and w. We study a special classification
bandit problem in which arms correspond to points x in d-dimensional real space
with expected rewards f(x) which are generated according to a Gaussian process
prior. We develop a framework algorithm for the problem using various arm
selection policies and propose policies called FCB and FTSV. We show a smaller
sample complexity upper bound for FCB than that for the existing algorithm of
the level set estimation, in which whether f(x) is at least h or not must be
decided for every arm's x. Arm selection policies depending on an estimated
rate of arms with rewards of at least h are also proposed and shown to improve
empirical sample complexity. According to our experimental results, the
rate-estimation versions of FCB and FTSV, together with that of the popular
active learning policy that selects the point with the maximum variance,
outperform other policies for synthetic functions, and the version of FTSV is
also the best performer for our real-world dataset
A super-polynomial quantum-classical separation for density modelling
Density modelling is the task of learning an unknown probability density
function from samples, and is one of the central problems of unsupervised
machine learning. In this work, we show that there exists a density modelling
problem for which fault-tolerant quantum computers can offer a super-polynomial
advantage over classical learning algorithms, given standard cryptographic
assumptions. Along the way, we provide a variety of additional results and
insights, of potential interest for proving future distribution learning
separations between quantum and classical learning algorithms. Specifically, we
(a) provide an overview of the relationships between hardness results in
supervised learning and distribution learning, and (b) show that any weak
pseudo-random function can be used to construct a classically hard density
modelling problem. The latter result opens up the possibility of proving
quantum-classical separations for density modelling based on weaker assumptions
than those necessary for pseudo-random functions.Comment: 15 pages, one figur
On the adequacy of untuned warmup for adaptive optimization
Adaptive optimization algorithms such as Adam are widely used in deep
learning. The stability of such algorithms is often improved with a warmup
schedule for the learning rate. Motivated by the difficulty of choosing and
tuning warmup schedules, recent work proposes automatic variance rectification
of Adam's adaptive learning rate, claiming that this rectified approach
("RAdam") surpasses the vanilla Adam algorithm and reduces the need for
expensive tuning of Adam with warmup. In this work, we refute this analysis and
provide an alternative explanation for the necessity of warmup based on the
magnitude of the update term, which is of greater relevance to training
stability. We then provide some "rule-of-thumb" warmup schedules, and we
demonstrate that simple untuned warmup of Adam performs more-or-less
identically to RAdam in typical practical settings. We conclude by suggesting
that practitioners stick to linear warmup with Adam, with a sensible default
being linear warmup over training iterations.Comment: AAAI 202