1,591 research outputs found
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms
We consider stochastic multi-armed bandit problems where the expected reward
is a Lipschitz function of the arm, and where the set of arms is either
discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic
problem specific lower bounds for the regret satisfied by any algorithm, and
propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz
structure of the problem. In fact, we prove that OSLB is asymptotically
optimal, as its asymptotic regret matches the lower bound. The regret analysis
of our algorithms relies on a new concentration inequality for weighted sums of
KL divergences between the empirical distributions of rewards and their true
distributions. For continuous Lipschitz bandits, we propose to first discretize
the action space, and then apply OSLB or CKL-UCB, algorithms that provably
exploit the structure efficiently. This approach is shown, through numerical
experiments, to significantly outperform existing algorithms that directly deal
with the continuous set of arms. Finally the results and algorithms are
extended to contextual bandits with similarities.Comment: COLT 201
Stacked Thompson Bandits
We introduce Stacked Thompson Bandits (STB) for efficiently generating plans
that are likely to satisfy a given bounded temporal logic requirement. STB uses
a simulation for evaluation of plans, and takes a Bayesian approach to using
the resulting information to guide its search. In particular, we show that
stacking multiarmed bandits and using Thompson sampling to guide the action
selection process for each bandit enables STB to generate plans that satisfy
requirements with a high probability while only searching a fraction of the
search space.Comment: Accepted at SEsCPS @ ICSE 201
Batched bandit problems
Motivated by practical applications, chiefly clinical trials, we study the
regret achievable for stochastic bandits under the constraint that the employed
policy must split trials into a small number of batches. We propose a simple
policy, and show that a very small number of batches gives close to minimax
optimal regret bounds. As a byproduct, we derive optimal policies with low
switching cost for stochastic bandits.Comment: Published at http://dx.doi.org/10.1214/15-AOS1381 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Decentralized Exploration in Multi-Armed Bandits
We consider the decentralized exploration problem: a set of players
collaborate to identify the best arm by asynchronously interacting with the
same stochastic environment. The objective is to insure privacy in the best arm
identification problem between asynchronous, collaborative, and thrifty
players. In the context of a digital service, we advocate that this
decentralized approach allows a good balance between the interests of users and
those of service providers: the providers optimize their services, while
protecting the privacy of the users and saving resources. We define the privacy
level as the amount of information an adversary could infer by intercepting the
messages concerning a single user. We provide a generic algorithm Decentralized
Elimination, which uses any best arm identification algorithm as a subroutine.
We prove that this algorithm insures privacy, with a low communication cost,
and that in comparison to the lower bound of the best arm identification
problem, its sample complexity suffers from a penalty depending on the inverse
of the probability of the most frequent players. Then, thanks to the genericity
of the approach, we extend the proposed algorithm to the non-stationary
bandits. Finally, experiments illustrate and complete the analysis
- …