4 research outputs found
Inverse Reinforcement Learning in Contextual MDPs
We consider the task of Inverse Reinforcement Learning in Contextual Markov
Decision Processes (MDPs). In this setting, contexts, which define the reward
and transition kernel, are sampled from a distribution. In addition, although
the reward is a function of the context, it is not provided to the agent.
Instead, the agent observes demonstrations from an optimal policy. The goal is
to learn the reward mapping, such that the agent will act optimally even when
encountering previously unseen contexts, also known as zero-shot transfer. We
formulate this problem as a non-differential convex optimization problem and
propose a novel algorithm to compute its subgradients. Based on this scheme, we
analyze several methods both theoretically, where we compare the sample
complexity and scalability, and empirically. Most importantly, we show both
theoretically and empirically that our algorithms perform zero-shot transfer
(generalize to new and unseen contexts). Specifically, we present empirical
experiments in a dynamic treatment regime, where the goal is to learn a reward
function which explains the behavior of expert physicians based on recorded
data of them treating patients diagnosed with sepsis
Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound
Exploration in reinforcement learning (RL) suffers from the curse of
dimensionality when the state-action space is large. A common practice is to
parameterize the high-dimensional value and policy functions using given
features. However existing methods either have no theoretical guarantee or
suffer a regret that is exponential in the planning horizon . In this paper,
we propose an online RL algorithm, namely the MatrixRL, that leverages ideas
from linear bandit to learn a low-dimensional representation of the probability
transition model while carefully balancing the exploitation-exploration
tradeoff. We show that MatrixRL achieves a regret bound where is the number of features. MatrixRL has an equivalent
kernelized version, which is able to work with an arbitrary kernel Hilbert
space without using explicit features. In this case, the kernelized MatrixRL
satisfies a regret bound , where
is the effective dimension of the kernel space. To our best
knowledge, for RL using features or kernels, our results are the first regret
bounds that are near-optimal in time and dimension (or )
and polynomial in the planning horizon
Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles
Reinforcement learning (RL) methods have been shown to be capable of learning
intelligent behavior in rich domains. However, this has largely been done in
simulated domains without adequate focus on the process of building the
simulator. In this paper, we consider a setting where we have access to an
ensemble of pre-trained and possibly inaccurate simulators (models). We
approximate the real environment using a state-dependent linear combination of
the ensemble, where the coefficients are determined by the given state features
and some unknown parameters. Our proposed algorithm provably learns a
near-optimal policy with a sample complexity polynomial in the number of
unknown parameters, and incurs no dependence on the size of the state (or
action) space. As an extension, we also consider the more challenging problem
of model selection, where the state features are unknown and can be chosen from
a large candidate set. We provide exponential lower bounds that illustrate the
fundamental hardness of this problem, and develop a provably efficient
algorithm under additional natural assumptions
PAC Bounds for Imitation and Model-based Batch Learning of Contextual Markov Decision Processes
We consider the problem of batch multi-task reinforcement learning with
observed context descriptors, motivated by its application to personalized
medical treatment. In particular, we study two general classes of learning
algorithms: direct policy learning (DPL), an imitation-learning based approach
which learns from expert trajectories, and model-based learning. First, we
derive sample complexity bounds for DPL, and then show that model-based
learning from expert actions can, even with a finite model class, be
impossible. After relaxing the conditions under which the model-based approach
is expected to learn by allowing for greater coverage of state-action space, we
provide sample complexity bounds for model-based learning with finite model
classes, showing that there exist model classes with sample complexity
exponential in their statistical complexity. We then derive a sample complexity
upper bound for model-based learning based on a measure of concentration of the
data distribution. Our results give formal justification for imitation learning
over model-based learning in this setting