12 research outputs found
Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning
Understanding human behavior from observed data is critical for transparency
and accountability in decision-making. Consider real-world settings such as
healthcare, in which modeling a decision-maker's policy is challenging -- with
no access to underlying states, no knowledge of environment dynamics, and no
allowance for live experimentation. We desire learning a data-driven
representation of decision-making behavior that (1) inheres transparency by
design, (2) accommodates partial observability, and (3) operates completely
offline. To satisfy these key criteria, we propose a novel model-based Bayesian
method for interpretable policy learning ("Interpole") that jointly estimates
an agent's (possibly biased) belief-update process together with their
(possibly suboptimal) belief-action mapping. Through experiments on both
simulated and real-world data for the problem of Alzheimer's disease diagnosis,
we illustrate the potential of our approach as an investigative device for
auditing, quantifying, and understanding human decision-making behavior
Imitation Learning with Sinkhorn Distances
Imitation learning algorithms have been interpreted as variants of divergence
minimization problems. The ability to compare occupancy measures between
experts and learners is crucial in their effectiveness in learning from
demonstrations. In this paper, we present tractable solutions by formulating
imitation learning as minimization of the Sinkhorn distance between occupancy
measures. The formulation combines the valuable properties of optimal transport
metrics in comparing non-overlapping distributions with a cosine distance cost
defined in an adversarially learned feature space. This leads to a highly
discriminative critic network and optimal transport plan that subsequently
guide imitation learning. We evaluate the proposed approach using both the
reward metric and the Sinkhorn distance metric on a number of MuJoCo
experiments
Online Apprenticeship Learning
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP)
without access to the cost function. Instead, we observe trajectories sampled
by an expert that acts according to some policy. The goal is to find a policy
that matches the expert's performance on some predefined set of cost functions.
We introduce an online variant of AL (Online Apprenticeship Learning; OAL),
where the agent is expected to perform comparably to the expert while
interacting with the environment. We show that the OAL problem can be
effectively solved by combining two mirror descent based no-regret algorithms:
one for policy optimization and another for learning the worst case cost. To
this end, we derive a convergent algorithm with regret, where
is the number of interactions with the MDP, and an additional linear error term
that depends on the amount of expert trajectories available. Importantly, our
algorithm avoids the need to solve an MDP at each iteration, making it more
practical compared to prior AL methods. Finally, we implement a deep variant of
our algorithm which shares some similarities to GAIL \cite{ho2016generative},
but where the discriminator is replaced with the costs learned by the OAL
problem. Our simulations demonstrate our theoretically grounded approach
outperforms the baselines
Recommended from our members
State-based Policy Representation for Deep Policy Learning
Reinforcement Learning has achieved noticeable success in many fields, such as video game playing, continuous control, and the game of Go. One the other hand, current approaches usually require large sample complexity, and also lack the transferability to similar tasks. Imitation learning, also known as ``learning from demonstrations'', is possible to mitigate the former problem by providing successful experiences. However, current methods usually assume the expert and imitator are the same, which lack flexibility and robustness when the dynamics change.Generalizability is the core of artificial intelligence. An agent should be able to apply its knowledge for novel tasks after training in similar environments, or providing related demonstrations. Given current observation, it should have the ability to predict what can happen (modeling), and what needs to happen (planning). This brings out challenges on how to represent the knowledge and how to utilize the knowledge by learning from interactions or demonstrations.In this thesis, we will systematically study two important problems, the universal goal-reaching problem and the cross-morphology imitation learning problem, which are representative challenges in the field of reinforcement learning and imitation learning. Laying out our research work that attends to these challenging tasks unfolds our roadmap towards the holy-grail goal: make the agent generalizable by learning from observations and model the world
Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning
Despite the recent success of reinforcement learning in various domains,
these approaches remain, for the most part, deterringly sensitive to
hyper-parameters and are often riddled with essential engineering feats
allowing their success. We consider the case of off-policy generative
adversarial imitation learning, and perform an in-depth review, qualitative and
quantitative, of the method. We show that forcing the learned reward function
to be local Lipschitz-continuous is a sine qua non condition for the method to
perform well. We then study the effects of this necessary condition and provide
several theoretical results involving the local Lipschitzness of the
state-value function. We complement these guarantees with empirical evidence
attesting to the strong positive effect that the consistent satisfaction of the
Lipschitzness constraint on the reward has on imitation performance. Finally,
we tackle a generic pessimistic reward preconditioning add-on spawning a large
class of reward shaping methods, which makes the base method it is plugged into
provably more robust, as shown in several additional theoretical guarantees. We
then discuss these through a fine-grained lens and share our insights.
Crucially, the guarantees derived and reported in this work are valid for any
reward satisfying the Lipschitzness condition, nothing is specific to
imitation. As such, these may be of independent interest