134 research outputs found
Online Learning in Case of Unbounded Losses Using the Follow Perturbed Leader Algorithm
In this paper the sequential prediction problem with expert advice is
considered for the case where losses of experts suffered at each step cannot be
bounded in advance. We present some modification of Kalai and Vempala algorithm
of following the perturbed leader where weights depend on past losses of the
experts. New notions of a volume and a scaled fluctuation of a game are
introduced. We present a probabilistic algorithm protected from unrestrictedly
large one-step losses. This algorithm has the optimal performance in the case
when the scaled fluctuations of one-step losses of experts of the pool tend to
zero.Comment: 31 pages, 3 figure
First-order regret bounds for combinatorial semi-bandits
We consider the problem of online combinatorial optimization under
semi-bandit feedback, where a learner has to repeatedly pick actions from a
combinatorial decision set in order to minimize the total losses associated
with its decisions. After making each decision, the learner observes the losses
associated with its action, but not other losses. For this problem, there are
several learning algorithms that guarantee that the learner's expected regret
grows as with the number of rounds . In this
paper, we propose an algorithm that improves this scaling to
, where is the total loss of the best
action. Our algorithm is among the first to achieve such guarantees in a
partial-feedback scheme, and the first one to do so in a combinatorial setting.Comment: To appear at COLT 201
Analysis of Perturbation Techniques in Online Learning
The most commonly used regularization technique in machine learning is to directly add a penalty function to the optimization objective. For example, regularization is universally applied to a wide range of models including linear regression and neural networks. The alternative regularization technique, which has become essential in modern applications of machine learning, is implicit regularization by injecting random noise into the training data.
In fact, this idea of using random perturbations as regularizer has been one of the first algorithms for online learning, where a learner chooses actions iteratively on a data sequence that may be designed adversarially to thwart learning process. One such classical algorithm is known as Follow The Perturbed Leader (FTPL).
This dissertation presents new interpretations of FTPL. In the first part, we show that FTPL is equivalent to playing the gradients of a stochastically smoothed potential function in the dual space. In the second part, we show that FTPL is the extension of a differentially private mechanism that has inherent stability guarantees. These perspectives lead to novel frameworks for FTPL regret analysis, which not only prove strong performance guarantees but also help characterize the optimal choice of noise distributions. Furthermore, they extend to the partial information setting where the learner observes only part of the input data.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/143968/1/chansool_1.pd
Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments
AbstractThe nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions (“experts”), under partial observation: In each round t, only the cost of the decision played is observable. A regret minimization algorithm plays this game while achieving sublinear regret relative to each decision. It is known that an adversary controlling the costs of the decisions can force the player a regret growing as t12 in the time t. In this work, we propose the first algorithm for a countably infinite set of decisions, that achieves a regret upper bounded by O(t12+ε), i.e. arbitrarily close to optimal order. To this aim, we build on the “follow the perturbed leader” principle, which dates back to work by Hannan in 1957. Our results hold against an adaptive adversary, for both the expected and high probability regret of the learner w.r.t. each decision. In the second part of the paper, we consider reactive problem settings, that is, situations where the learner’s decisions impact on the future behaviour of the adversary, and a strong strategy can draw benefit from well chosen past actions. We present a variant of our regret minimization algorithm which has still regret of order at most t12+ε relative to such strong strategies, and even sublinear regret not exceeding O(t45) w.r.t. the hypothetical (without external interference) performance of a strong strategy. We show how to combine the regret minimizer with a universal class of experts, given by the countable set of programs on some fixed universal Turing machine. This defines a universal learner with sublinear regret relative to any computable strategy
On Adaptivity in Information-constrained Online Learning
We study how to adapt to smoothly-varying ('easy') environments in well-known
online learning problems where acquiring information is expensive. For the
problem of label efficient prediction, which is a budgeted version of
prediction with expert advice, we present an online algorithm whose regret
depends optimally on the number of labels allowed and (the quadratic
variation of the losses of the best action in hindsight), along with a
parameter-free counterpart whose regret depends optimally on (the quadratic
variation of the losses of all the actions). These quantities can be
significantly smaller than (the total time horizon), yielding an
improvement over existing, variation-independent results for the problem. We
then extend our analysis to handle label efficient prediction with bandit
feedback, i.e., label efficient bandits. Our work builds upon the framework of
optimistic online mirror descent, and leverages second order corrections along
with a carefully designed hybrid regularizer that encodes the constrained
information structure of the problem. We then consider revealing action-partial
monitoring games -- a version of label efficient prediction with additive
information costs, which in general are known to lie in the \textit{hard} class
of games having minimax regret of order . We provide a
strategy with an bound for revealing action
games, along with one with a bound for the
full class of hard partial monitoring games, both being strict improvements
over current bounds.Comment: 34th AAAI Conference on Artificial Intelligence (AAAI 2020). Short
version at 11th Optimization for Machine Learning workshop (OPT 2019
- …