13 research outputs found
A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints
Constrained Markov Decision Processes (CMDPs) formalize sequential
decision-making problems whose objective is to minimize a cost function while
satisfying constraints on various cost functions. In this paper, we consider
the setting of episodic fixed-horizon CMDPs. We propose an online algorithm
which leverages the linear programming formulation of finite-horizon CMDP for
repeated optimistic planning to provide a probably approximately correct (PAC)
guarantee on the number of episodes needed to ensure an -optimal
policy, i.e., with resulting objective value within of the optimal
value and satisfying the constraints within -tolerance, with
probability at least . The number of episodes needed is shown to be
of the order
,
where is the upper bound on the number of possible successor states for a
state-action pair. Therefore, if , the number of episodes needed
have a linear dependence on the state and action space sizes and ,
respectively, and quadratic dependence on the time horizon
Reinforcement Learning with Trajectory Feedback
The standard feedback model of reinforcement learning requires revealing the
reward of every visited state-action pair. However, in practice, it is often
the case that such frequent feedback is not available. In this work, we take a
first step towards relaxing this assumption and require a weaker form of
feedback, which we refer to as \emph{trajectory feedback}. Instead of observing
the reward obtained after every action, we assume we only receive a score that
represents the quality of the whole trajectory observed by the agent, namely,
the sum of all rewards obtained over this trajectory. We extend reinforcement
learning algorithms to this setting, based on least-squares estimation of the
unknown reward, for both the known and unknown transition model cases, and
study the performance of these algorithms by analyzing their regret. For cases
where the transition model is unknown, we offer a hybrid optimistic-Thompson
Sampling approach that results in a tractable algorithm.Comment: AAAI202