4 research outputs found
Uncertainty and Exploration in a Restless Bandit Problem
Decision making in noisy and changing environments requires a fine balance between exploiting knowledge about good courses of action and exploring the environment in order to improve upon this knowledge. We present an experiment on a restless bandit task in which participants made repeated choices between options for which the average rewards changed over time. Comparing a number of computational models of participants' behavior in this task, we find evidence that a substantial number of them balanced exploration and exploitation by considering the probability that an option offers the maximum reward out of all the available options
A Definition of Non-Stationary Bandits
Despite the subject of non-stationary bandit learning having attracted much
recent attention, we have yet to identify a formal definition of
non-stationarity that can consistently distinguish non-stationary bandits from
stationary ones. Prior work has characterized non-stationary bandits as bandits
for which the reward distribution changes over time. We demonstrate that this
definition can ambiguously classify the same bandit as both stationary and
non-stationary; this ambiguity arises in the existing definition's dependence
on the latent sequence of reward distributions. Moreover, the definition has
given rise to two widely used notions of regret: the dynamic regret and the
weak regret. These notions are not indicative of qualitative agent performance
in some bandits. Additionally, this definition of non-stationary bandits has
led to the design of agents that explore excessively. We introduce a formal
definition of non-stationary bandits that resolves these issues. Our new
definition provides a unified approach, applicable seamlessly to both Bayesian
and frequentist formulations of bandits. Furthermore, our definition ensures
consistent classification of two bandits offering agents indistinguishable
experiences, categorizing them as either both stationary or both
non-stationary. This advancement provides a more robust framework for
non-stationary bandit learning
Non-Stationary Bandit Learning via Predictive Sampling
Thompson sampling has proven effective across a wide range of stationary
bandit environments. However, as we demonstrate in this paper, it can perform
poorly when applied to non-stationary environments. We show that such failures
are attributed to the fact that, when exploring, the algorithm does not
differentiate actions based on how quickly the information acquired loses its
usefulness due to non-stationarity. Building upon this insight, we propose
predictive sampling, an algorithm that deprioritizes acquiring information that
quickly loses usefulness. Theoretical guarantee on the performance of
predictive sampling is established through a Bayesian regret bound. We provide
versions of predictive sampling for which computations tractably scale to
complex bandit environments of practical interest. Through numerical
simulations, we demonstrate that predictive sampling outperforms Thompson
sampling in all non-stationary environments examined
Thompson Sampling for Bayesian Bandits with Resets
International audienceMulti-armed bandit problems are challenging sequential decision problems that have been widely studied as they constitute a mathematical framework that abstracts many different decision problems in fields such as machine learning, logistics, industrial optimization, management of clinical trials, etc. In this paper we address a non stationary environment with expected rewards that are dynamically evolving, considering a particular type of drift, that we call resets, in which the arm qualities are re-initialized from time to time. We compare different arm selection strategies with simulations, focusing on a Bayesian method based on Thompson sampling (a simple, yet effective, technique for trading off between exploration and exploitation)