284 research outputs found
An efficient algorithm for learning with semi-bandit feedback
We consider the problem of online combinatorial optimization under
semi-bandit feedback. The goal of the learner is to sequentially select its
actions from a combinatorial decision set so as to minimize its cumulative
loss. We propose a learning algorithm for this problem based on combining the
Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss
estimation procedure called Geometric Resampling (GR). Contrary to previous
solutions, the resulting algorithm can be efficiently implemented for any
decision set where efficient offline combinatorial optimization is possible at
all. Assuming that the elements of the decision set can be described with
d-dimensional binary vectors with at most m non-zero entries, we show that the
expected regret of our algorithm after T rounds is O(m sqrt(dT log d)). As a
side result, we also improve the best known regret bounds for FPL in the full
information setting to O(m^(3/2) sqrt(T log d)), gaining a factor of sqrt(d/m)
over previous bounds for this algorithm.Comment: submitted to ALT 201
Online Learning with Switching Costs and Other Adaptive Adversaries
We study the power of different types of adaptive (nonoblivious) adversaries
in the setting of prediction with expert advice, under both full-information
and bandit feedback. We measure the player's performance using a new notion of
regret, also known as policy regret, which better captures the adversary's
adaptiveness to the player's behavior. In a setting where losses are allowed to
drift, we characterize ---in a nearly complete manner--- the power of adaptive
adversaries with bounded memories and switching costs. In particular, we show
that with switching costs, the attainable rate with bandit feedback is
. Interestingly, this rate is significantly worse
than the rate attainable with switching costs in the
full-information case. Via a novel reduction from experts to bandits, we also
show that a bounded memory adversary can force
regret even in the full information case, proving that switching costs are
easier to control than bounded memory adversaries. Our lower bounds rely on a
new stochastic adversary strategy that generates loss processes with strong
dependencies
Fighting Bandits with a New Kind of Smoothness
We define a novel family of algorithms for the adversarial multi-armed bandit
problem, and provide a simple analysis technique based on convex smoothing. We
prove two main results. First, we show that regularization via the
\emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the
minimax regret. Second, we show that a wide class of
perturbation methods achieve a near-optimal regret as low as if the perturbation distribution has a bounded hazard rate. For example,
the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this
key property.Comment: In Proceedings of NIPS, 201
Minimax Policies for Combinatorial Prediction Games
We address the online linear optimization problem when the actions of the
forecaster are represented by binary vectors. Our goal is to understand the
magnitude of the minimax regret for the worst possible set of actions. We study
the problem under three different assumptions for the feedback: full
information, and the partial information models of the so-called "semi-bandit",
and "bandit" problems. We consider both -, and -type of
restrictions for the losses assigned by the adversary.
We formulate a general strategy using Bregman projections on top of a
potential-based gradient descent, which generalizes the ones studied in the
series of papers Gyorgy et al. (2007), Dani et al. (2008), Abernethy et al.
(2008), Cesa-Bianchi and Lugosi (2009), Helmbold and Warmuth (2009), Koolen et
al. (2010), Uchiya et al. (2010), Kale et al. (2010) and Audibert and Bubeck
(2010). We provide simple proofs that recover most of the previous results. We
propose new upper bounds for the semi-bandit game. Moreover we derive lower
bounds for all three feedback assumptions. With the only exception of the
bandit game, the upper and lower bounds are tight, up to a constant factor.
Finally, we answer a question asked by Koolen et al. (2010) by showing that the
exponentially weighted average forecaster is suboptimal against
adversaries
- âŠ