29 research outputs found
Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals
We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where a
player chooses from arms with unknown expected rewards and costs. The goal
is to maximize the total reward under a budget constraint. A player thus seeks
to choose the arm with the highest reward-cost ratio as often as possible.
Current state-of-the-art policies for this problem have several issues, which
we illustrate. To overcome them, we propose a new upper confidence bound (UCB)
sampling policy, -UCB, that uses asymmetric confidence intervals. These
intervals scale with the distance between the sample mean and the bounds of a
random variable, yielding a more accurate and tight estimation of the
reward-cost ratio compared to our competitors. We show that our approach has
logarithmic regret and consistently outperforms existing policies in synthetic
and real settings
Finding the bandit in a graph: Sequential search-and-stop
We consider the problem where an agent wants to find a hidden object that is
randomly located in some vertex of a directed acyclic graph (DAG) according to
a fixed but possibly unknown distribution. The agent can only examine vertices
whose in-neighbors have already been examined. In this paper, we address a
learning setting where we allow the agent to stop before having found the
object and restart searching on a new independent instance of the same problem.
Our goal is to maximize the total number of hidden objects found given a time
budget. The agent can thus skip an instance after realizing that it would spend
too much time on it. Our contributions are both to the search theory and
multi-armed bandits. If the distribution is known, we provide a quasi-optimal
and efficient stationary strategy. If the distribution is unknown, we
additionally show how to sequentially approximate it and, at the same time, act
near-optimally in order to collect as many hidden objects as possible.Comment: in International Conference on Artificial Intelligence and Statistics
(AISTATS 2019), April 2019, Naha, Okinawa, Japa
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Finding the bandit in a graph: Sequential search-and-stop
International audienceWe consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution. The agent can only examine vertices whose in-neighbors have already been examined. In this paper, we address a learning setting where we allow the agent to stop before having found the object and restart searching on a new independent instance of the same problem. Our goal is to maximize the total number of hidden objects found given a time budget. The agent can thus skip an instance after realizing that it would spend too much time on it. Our contributions are both to the search theory and multi-armed bandits. If the distribution is known, we provide a quasi-optimal and efficient stationary strategy. If the distribution is unknown, we additionally show how to sequentially approximate it and, at the same time, act near-optimally in order to collect as many hidden objects as possible