3,111 research outputs found
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
- …