Search CORE

13 research outputs found

A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints

Author: Jain Rahul
Kalagarla Krishna C.
Nuzzo Pierluigi
Publication venue
Publication date: 23/09/2020
Field of study

Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an

\epsilon

-optimal policy, i.e., with resulting objective value within

\epsilon

of the optimal value and satisfying the constraints within

\epsilon

-tolerance, with probability at least

1-\delta

. The number of episodes needed is shown to be of the order

\tilde{\mathcal{O}}\big(\frac{|S||A|C^{2}H^{2}}{\epsilon^{2}}\log\frac{1}{\delta}\big)

, where

C

is the upper bound on the number of possible successor states for a state-action pair. Therefore, if

C \ll |S|

, the number of episodes needed have a linear dependence on the state and action space sizes

|S|

and

|A|

, respectively, and quadratic dependence on the time horizon

H

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Reinforcement Learning with Trajectory Feedback

Author: Efroni Yonathan
Mannor Shie
Merlis Nadav
Publication venue
Publication date: 04/03/2021
Field of study

The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.Comment: AAAI202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications