Search CORE

1,323 research outputs found

Examining average and discounted reward optimality criteria in reinforcement learning

Author: Dewanto Vektor
Gallagher Marcus
Publication venue
Publication date: 03/07/2021
Field of study

In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former. While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for developing the general discounting-free optimality criterion (Veinott, 1969) in RL.Comment: 14 pages, 3 figures, 10-page main conten

arXiv.org e-Print Archive

Learning policies for Markov decision processes from data

Author: Hanawal Manjesh K.
Liu Hao
Paschalidis Ioannis Ch.
Zhu Henghui
Publication venue
Publication date: 01/01/2017
Field of study

We consider the problem of learning a policy for a Markov decision process consistent with data captured on the state-actions pairs followed by the policy. We assume that the policy belongs to a class of parameterized policies which are defined using features associated with the state-action pairs. The features are known a priori, however, only an unknown subset of them could be relevant. The policy parameters that correspond to an observed target policy are recovered using `1-regularized logistic regression that best fits the observed state-action samples. We establish bounds on the difference between the average reward of the estimated and the original policy (regret) in terms of the generalization error and the ergodic coefficient of the underlying Markov chain. To that end, we combine sample complexity theory and sensitivity analysis of the stationary distribution of Markov chains. Our analysis suggests that to achieve regret within order O( √ ), it suffices to use training sample size on the order of Ω(logn · poly(1/ )), where n is the number of the features. We demonstrate the effectiveness of our method on a synthetic robot navigation example

Boston University Institutional Repository (OpenBU)

Learning policies for Markov decision processes from data

Author: Hanawal Manjesh K.
Liu Hao
Paschalidis Ioannis Ch.
Zhu Henghui
Publication venue
Publication date: 01/01/2017
Field of study

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)