Search CORE

24,973 research outputs found

Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

Author: Pitis Silviu
Publication venue
Publication date: 07/02/2019
Field of study

Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor

\gamma < 1

, or in episodic settings, with

\gamma = 1

. While this has proven effective for specific tasks with well-defined objectives (e.g., games), it has never been established that fixed discounting is suitable for general purpose use (e.g., as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular, our framework admits a state-action dependent "discount" factor that is not constrained to be less than 1, so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings, we show that there exists a unique "optimizing MDP" with fixed

\gamma < 1

whose optimal value function matches the true utility of the optimal policy, and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White's RL task formalism (2017) and other recent departures from the traditional RL, and is relevant to task specification in RL, inverse RL and preference-based RL.Comment: 8 pages + 1 page supplement. In proceedings of AAAI 2019. Slides, poster and bibtex available at https://silviupitis.com/#rethinking-the-discount-factor-in-reinforcement-learning-a-decision-theoretic-approac

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Trajectory-Based Off-Policy Deep Reinforcement Learning

Author: Daniel Christian
Doerr Andreas
Toussaint Marc
Trimpe Sebastian
Volpp Michael
Publication venue
Publication date: 14/05/2019
Field of study

Policy gradient methods are powerful reinforcement learning algorithms and have been demonstrated to solve many complex tasks. However, these methods are also data-inefficient, afflicted with high variance gradient estimates, and frequently get stuck in local optima. This work addresses these weaknesses by combining recent improvements in the reuse of off-policy data and exploration in parameter space with deterministic behavioral policies. The resulting objective is amenable to standard neural network optimization strategies like stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo. Incorporation of previous rollouts via importance sampling greatly improves data-efficiency, whilst stochastic optimization schemes facilitate the escape from local optima. We evaluate the proposed approach on a series of continuous control benchmark tasks. The results show that the proposed algorithm is able to successfully and reliably learn solutions using fewer system interactions than standard policy gradient methods.Comment: Includes appendix. Accepted for ICML 201

arXiv.org e-Print Archive

MPG.PuRe

Sequential Design for Ranking Response Surfaces

Author: Hu Ruimeng
Ludkovski Mike
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 12/07/2016
Field of study

We propose and analyze sequential design methods for the problem of ranking several response surfaces. Namely, given

L \ge 2

response surfaces over a continuous input space

\cal X

, the aim is to efficiently find the index of the minimal response across the entire

\cal X

. The response surfaces are not known and have to be noisily sampled one-at-a-time. This setting is motivated by stochastic control applications and requires joint experimental design both in space and response-index dimensions. To generate sequential design heuristics we investigate stepwise uncertainty reduction approaches, as well as sampling based on posterior classification complexity. We also make connections between our continuous-input formulation and the discrete framework of pure regret in multi-armed bandits. To model the response surfaces we utilize kriging surrogates. Several numerical examples using both synthetic data and an epidemics control problem are provided to illustrate our approach and the efficacy of respective adaptive designs.Comment: 26 pages, 7 figures (updated several sections and figures

arXiv.org e-Print Archive

eScholarship - University of California