Search CORE

8,460 research outputs found

Risk-sensitive Inverse Reinforcement Learning via Semi- and Non-Parametric Methods

Author: Lacotte Jonathan
Majumdar Anirudha
Pavone Marco
Singh Sumeet
Publication venue
Publication date: 01/01/2018
Field of study

The literature on Inverse Reinforcement Learning (IRL) typically assumes that humans take actions in order to minimize the expected value of a cost function, i.e., that humans are risk neutral. Yet, in practice, humans are often far from being risk neutral. To fill this gap, the objective of this paper is to devise a framework for risk-sensitive IRL in order to explicitly account for a human's risk sensitivity. To this end, we propose a flexible class of models based on coherent risk measures, which allow us to capture an entire spectrum of risk preferences from risk-neutral to worst-case. We propose efficient non-parametric algorithms based on linear programming and semi-parametric algorithms based on maximum likelihood for inferring a human's underlying risk measure and cost function for a rich class of static and dynamic decision-making settings. The resulting approach is demonstrated on a simulated driving game with ten human participants. Our method is able to infer and mimic a wide range of qualitatively different driving styles from highly risk-averse to risk-neutral in a data-efficient manner. Moreover, comparisons of the Risk-Sensitive (RS) IRL approach with a risk-neutral model show that the RS-IRL framework more accurately captures observed participant behavior both qualitatively and quantitatively, especially in scenarios where catastrophic outcomes such as collisions can occur.Comment: Submitted to International Journal of Robotics Research; Revision 1: (i) Clarified minor technical points; (ii) Revised proof for Theorem 3 to hold under weaker assumptions; (iii) Added additional figures and expanded discussions to improve readabilit

arXiv.org e-Print Archive

Princeton University Open Access Repository

Cover Tree Bayesian Reinforcement Learning

Author: Blekas Konstantinos
Dimitrakakis Christos
Tziortziotis Nikolaos
Publication venue
Publication date: 08/12/2013
Field of study

This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with least squares policy iteration

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Chalmers Research

Chalmers Publication Library

A tutorial on recursive models for analyzing and predicting path choice behavior

Author: Frejinger Emma
Zimmermann Maëlle
Publication venue
Publication date: 19/03/2020
Field of study

The problem at the heart of this tutorial consists in modeling the path choice behavior of network users. This problem has been extensively studied in transportation science, where it is known as the route choice problem. In this literature, individuals' choice of paths are typically predicted using discrete choice models. This article is a tutorial on a specific category of discrete choice models called recursive, and it makes three main contributions: First, for the purpose of assisting future research on route choice, we provide a comprehensive background on the problem, linking it to different fields including inverse optimization and inverse reinforcement learning. Second, we formally introduce the problem and the recursive modeling idea along with an overview of existing models, their properties and applications. Third, we extensively analyze illustrative examples from different angles so that a novice reader can gain intuition on the problem and the advantages provided by recursive models in comparison to path-based ones

arXiv.org e-Print Archive

Nonparametric learning rules from bandit experiments: the eyes have it!

Author: Matt Shum
Yingyao Hu
Yutaka Kayaba
Publication venue
Publication date
Field of study

We estimate nonparametric learning rules using data from dynamic two-armed bandit (probabilistic reversal learning) experiments, supplemented with auxiliary eye-movement measures of subjects' beliefs. We apply recent econometric developments in the estimation of dynamic models. The direct estimation of learning rules differs from the usual modus operandi of the experimental literature. The estimated choice probabilities and learning rules from our nonparametric models have some distinctive features; notably that subjects tend to update in a non-smooth manner following positive 'exploitative' choices (those made in accordance with current beliefs). Simulation results show how the estimated nonparametric learning rules fit aspects of subjects' observed choice sequences better than alternative parameterized learning rules from Bayesian and reinforcement learning models.

Research Papers in Economics

An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

Author: Principe Jose C.
Sledge Isaac J.
Publication venue: 'MDPI AG'
Publication date: 01/02/2018
Field of study

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.Comment: Entrop

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals