Search CORE

28 research outputs found

Computing a Classic Index for Finite-Horizon Bandits

Author: Bellman R.
Berry D. A.
Bradt R. N.
Dongarra J. J.
Ginebra J.
Gittins J. C.
Gittins J. C.
Gittins J. C.
José Niño-Mora
Niño-Mora J.
Niño-Mora J.
Niño-Mora J.
Robbins H.
Varaiya P. P.
Wang Y.-G.
Publication venue: 'Institute for Operations Research and the Management Sciences (INFORMS)'
Publication date
Field of study

Using Inverse Reinforcement Learning with Real Trajectories to Get More Trustworthy Pedestrian Simulation

Author: García-Fernández Ignacio
Lozano Ibáñez Miguel
Martinez Gil Francisco Antonio
Romero Pau
Sebastián Aguilar Rafael
Serra Dolors
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

Reinforcement learning is one of the most promising machine learning techniques to get intelligent behaviors for embodied agents in simulations. The output of the classic Temporal Difference family of Reinforcement Learning algorithms adopts the form of a value function expressed as a numeric table or a function approximator. The learned behavior is then derived using a greedy policy with respect to this value function. Nevertheless, sometimes the learned policy does not meet expectations, and the task of authoring is difficult and unsafe because the modification of one value or parameter in the learned value function has unpredictable consequences in the space of the policies it represents. This invalidates direct manipulation of the learned value function as a method to modify the derived behaviors. In this paper, we propose the use of Inverse Reinforcement Learning to incorporate real behavior traces in the learning process to shape the learned behaviors, thus increasing their trustworthiness (in terms of conformance to reality). To do so, we adapt the Inverse Reinforcement Learning framework to the navigation problem domain. Specifically, we use Soft Q-learning, an algorithm based on the maximum causal entropy principle, with MARL-Ped (a Reinforcement Learning-based pedestrian simulator) to include information from trajectories of real pedestrians in the process of learning how to navigate inside a virtual 3D space that represents the real environment. A comparison with the behaviors learned using a Reinforcement Learning classic algorithm (Sarsa(λ)) shows that the Inverse Reinforcement Learning behaviors adjust significantly better to the real trajectories

Repositori d'Objectes Digitals per a l'Ensenyament la Recerca i la Cultura

Activity report. 2014

Author
Publication venue
Publication date: 01/01/2015
Field of study

Universidad Carlos III de Madrid e-Archivo

BelMan: Bayesian Bandits on the Belief--Reward Manifold

Author: Basu Debabrota
Bressan Stéphane
Senellart Pierre
Publication venue
Publication date: 10/10/2018
Field of study

We propose a generic, Bayesian, information geometric approach to the exploration--exploitation trade-off in multi-armed bandit problems. Our approach, BelMan, uniformly supports pure exploration, exploration--exploitation, and two-phase bandit problems. The knowledge on bandit arms and their reward distributions is summarised by the barycentre of the joint distributions of beliefs and rewards of the arms, the \emph{pseudobelief-reward}, within the beliefs-rewards manifold. BelMan alternates \emph{information projection} and \emph{reverse information projection}, i.e., projection of the pseudobelief-reward onto beliefs-rewards to choose the arm to play, and projection of the resulting beliefs-rewards onto the pseudobelief-reward. It introduces a mechanism that infuses an exploitative bias by means of a \emph{focal distribution}, i.e., a reward distribution that gradually concentrates on higher rewards. Comparative performance evaluation with state-of-the-art algorithms shows that BelMan is not only competitive but can also outperform other approaches in specific setups, for instance involving many arms and continuous rewards.Comment: 36 pages, 14 figures, accepted in ECML PKDD 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges.

Author: Bowden Jack
Villar Sofía S
Wason James
Publication venue: Stat Sci
Publication date: 01/01/2015
Field of study

Multi-armed bandit problems (MABPs) are a special type of optimal control problem well suited to model resource allocation under uncertainty in a wide variety of contexts. Since the first publication of the optimal solution of the classic MABP by a dynamic index rule, the bandit literature quickly diversified and emerged as an active research topic. Across this literature, the use of bandit models to optimally design clinical trials became a typical motivating application, yet little of the resulting theory has ever been used in the actual design and analysis of clinical trials. To this end, we review two MABP decision-theoretic approaches to the optimal allocation of treatments in a clinical trial: the infinite-horizon Bayesian Bernoulli MABP and the finite-horizon variant. These models possess distinct theoretical properties and lead to separate allocation rules in a clinical trial design context. We evaluate their performance compared to other allocation rules, including fixed randomization. Our results indicate that bandit approaches offer significant advantages, in terms of assigning more patients to better treatments, and severe limitations, in terms of their resulting statistical power. We propose a novel bandit-based patient allocation rule that overcomes the issue of low power, thus removing a potential barrier for their use in practice

arXiv.org e-Print Archive

Crossref

PubMed Central

Apollo (Cambridge)

Explore Bristol Research