Search CORE

485 research outputs found

Functional Bandits

Author: Tran-Thanh Long
Yu Jia Yuan
Publication venue
Publication date: 10/05/2014
Field of study

We introduce the functional bandit problem, where the objective is to find an arm that optimises a known functional of the unknown arm-reward distributions. These problems arise in many settings such as maximum entropy methods in natural language processing, and risk-averse decision-making, but current best-arm identification techniques fail in these domains. We propose a new approach, that combines functional estimation and arm elimination, to tackle this problem. This method achieves provably efficient performance guarantees. In addition, we illustrate this method on a number of important functionals in risk management and information theory, and refine our generic theoretical results in those cases

arXiv.org e-Print Archive

CiteSeerX

Finding a most biased coin with fewest flips

Author: Chandrasekaran Karthekeyan
Karp Richard
Publication venue
Publication date: 07/09/2013
Field of study

We study the problem of learning a most biased coin among a set of coins by tossing the coins adaptively. The goal is to minimize the number of tosses until we identify a coin i* whose posterior probability of being most biased is at least 1-delta for a given delta. Under a particular probabilistic model, we give an optimal algorithm, i.e., an algorithm that minimizes the expected number of future tosses. The problem is closely related to finding the best arm in the multi-armed bandit problem using adaptive strategies. Our algorithm employs an optimal adaptive strategy -- a strategy that performs the best possible action at each step after observing the outcomes of all previous coin tosses. Consequently, our algorithm is also optimal for any starting history of outcomes. To our knowledge, this is the first algorithm that employs an optimal adaptive strategy under a Bayesian setting for this problem. Our proof of optimality employs tools from the field of Markov games

arXiv.org e-Print Archive

CiteSeerX

Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ortner Ronald
Ryabko Daniil
Publication venue
Publication date: 01/01/2012
Field of study

We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Holder continuity of rewards and transition probabilities

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL-Rennes 1

An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

Author: Principe Jose C.
Sledge Isaac J.
Publication venue: 'MDPI AG'
Publication date: 01/02/2018
Field of study

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.Comment: Entrop

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Satisficing in multi-armed bandit problems

Author: Leonard Naomi Ehrich
Reverdy Paul
Srivastava Vaibhav
Publication venue
Publication date: 19/12/2016
Field of study

Satisficing is a relaxation of maximizing and allows for less risky decision making in the face of uncertainty. We propose two sets of satisficing objectives for the multi-armed bandit problem, where the objective is to achieve reward-based decision-making performance above a given threshold. We show that these new problems are equivalent to various standard multi-armed bandit problems with maximizing objectives and use the equivalence to find bounds on performance. The different objectives can result in qualitatively different behavior; for example, agents explore their options continually in one case and only a finite number of times in another. For the case of Gaussian rewards we show an additional equivalence between the two sets of satisficing objectives that allows algorithms developed for one set to be applied to the other. We then develop variants of the Upper Credible Limit (UCL) algorithm that solve the problems with satisficing objectives and show that these modified UCL algorithms achieve efficient satisficing performance.Comment: To appear in IEEE Transactions on Automatic Contro

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref