1,455 research outputs found
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Strategies for prediction under imperfect monitoring
We propose simple randomized strategies for sequential prediction under
imperfect monitoring, that is, when the forecaster does not have access to the
past outcomes but rather to a feedback signal. The proposed strategies are
consistent in the sense that they achieve, asymptotically, the best possible
average reward. It was Rustichini (1999) who first proved the existence of such
consistent predictors. The forecasters presented here offer the first
constructive proof of consistency. Moreover, the proposed algorithms are
computationally efficient. We also establish upper bounds for the rates of
convergence. In the case of deterministic feedback, these rates are optimal up
to logarithmic terms.Comment: Journal version of a COLT conference pape
Spectrum Bandit Optimization
We consider the problem of allocating radio channels to links in a wireless
network. Links interact through interference, modelled as a conflict graph
(i.e., two interfering links cannot be simultaneously active on the same
channel). We aim at identifying the channel allocation maximizing the total
network throughput over a finite time horizon. Should we know the average radio
conditions on each channel and on each link, an optimal allocation would be
obtained by solving an Integer Linear Program (ILP). When radio conditions are
unknown a priori, we look for a sequential channel allocation policy that
converges to the optimal allocation while minimizing on the way the throughput
loss or {\it regret} due to the need for exploring sub-optimal allocations. We
formulate this problem as a generic linear bandit problem, and analyze it first
in a stochastic setting where radio conditions are driven by a stationary
stochastic process, and then in an adversarial setting where radio conditions
can evolve arbitrarily. We provide new algorithms in both settings and derive
upper bounds on their regrets.Comment: 21 page
- …