Search CORE

4 research outputs found

Uncertainty and Exploration in a Restless Bandit Problem

Author: Acuna
Ahn
Behrens
Berry
Busemeyer
Cohen
Daw
Erev
Gittins
Gluck
Granmo
Gupta
Kalman
Kalman
Kim
Knox
Luce
Papadimitriou
Steyvers
Sutton
Thompson
Tversky
Viappiani
Wagenmakers
Whittle
Yechiam
Yi
Publication venue: 'Wiley'
Publication date: 21/04/2015
Field of study

Decision making in noisy and changing environments requires a fine balance between exploiting knowledge about good courses of action and exploring the environment in order to improve upon this knowledge. We present an experiment on a restless bandit task in which participants made repeated choices between options for which the average rewards changed over time. Comparing a number of computational models of participants' behavior in this task, we find evidence that a substantial number of them balanced exploration and exploitation by considering the probability that an option offers the maximum reward out of all the available options

Crossref

UCL Discovery

A Definition of Non-Stationary Bandits

Author: Kuang Xu
Liu Yueyang
Van Roy Benjamin
Publication venue
Publication date: 28/07/2023
Field of study

Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones. Prior work has characterized non-stationary bandits as bandits for which the reward distribution changes over time. We demonstrate that this definition can ambiguously classify the same bandit as both stationary and non-stationary; this ambiguity arises in the existing definition's dependence on the latent sequence of reward distributions. Moreover, the definition has given rise to two widely used notions of regret: the dynamic regret and the weak regret. These notions are not indicative of qualitative agent performance in some bandits. Additionally, this definition of non-stationary bandits has led to the design of agents that explore excessively. We introduce a formal definition of non-stationary bandits that resolves these issues. Our new definition provides a unified approach, applicable seamlessly to both Bayesian and frequentist formulations of bandits. Furthermore, our definition ensures consistent classification of two bandits offering agents indistinguishable experiences, categorizing them as either both stationary or both non-stationary. This advancement provides a more robust framework for non-stationary bandit learning

arXiv.org e-Print Archive

Non-Stationary Bandit Learning via Predictive Sampling

Author: Liu Yueyang
Van Roy Benjamin
Xu Kuang
Publication venue
Publication date: 13/03/2023
Field of study

Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. Theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined

arXiv.org e-Print Archive

Thompson Sampling for Bayesian Bandits with Resets

Author: Viappiani Paolo
Publication venue: Springer Berlin Heidelberg
Publication date: 01/01/2013
Field of study

International audienceMulti-armed bandit problems are challenging sequential decision problems that have been widely studied as they constitute a mathematical framework that abstracts many different decision problems in fields such as machine learning, logistics, industrial optimization, management of clinical trials, etc. In this paper we address a non stationary environment with expected rewards that are dynamically evolving, considering a particular type of drift, that we call resets, in which the arm qualities are re-initialized from time to time. We compare different arm selection strategies with simulations, focusing on a Bayesian method based on Thompson sampling (a simple, yet effective, technique for trading off between exploration and exploitation)

Crossref

HAL Descartes

Hal-Diderot