434 research outputs found
An efficient algorithm for learning with semi-bandit feedback
We consider the problem of online combinatorial optimization under
semi-bandit feedback. The goal of the learner is to sequentially select its
actions from a combinatorial decision set so as to minimize its cumulative
loss. We propose a learning algorithm for this problem based on combining the
Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss
estimation procedure called Geometric Resampling (GR). Contrary to previous
solutions, the resulting algorithm can be efficiently implemented for any
decision set where efficient offline combinatorial optimization is possible at
all. Assuming that the elements of the decision set can be described with
d-dimensional binary vectors with at most m non-zero entries, we show that the
expected regret of our algorithm after T rounds is O(m sqrt(dT log d)). As a
side result, we also improve the best known regret bounds for FPL in the full
information setting to O(m^(3/2) sqrt(T log d)), gaining a factor of sqrt(d/m)
over previous bounds for this algorithm.Comment: submitted to ALT 201
On the Complexity of Bandit Linear Optimization
We study the attainable regret for online linear optimization problems with
bandit feedback, where unlike the full-information setting, the player can only
observe its own loss rather than the full loss vector. We show that the price
of bandit information in this setting can be as large as , disproving the
well-known conjecture that the regret for bandit linear optimization is at most
times the full-information regret. Surprisingly, this is shown using
"trivial" modifications of standard domains, which have no effect in the
full-information setting. This and other results we present highlight some
interesting differences between full-information and bandit learning, which
were not considered in previous literature
Fighting Bandits with a New Kind of Smoothness
We define a novel family of algorithms for the adversarial multi-armed bandit
problem, and provide a simple analysis technique based on convex smoothing. We
prove two main results. First, we show that regularization via the
\emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the
minimax regret. Second, we show that a wide class of
perturbation methods achieve a near-optimal regret as low as if the perturbation distribution has a bounded hazard rate. For example,
the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this
key property.Comment: In Proceedings of NIPS, 201
- …