96 research outputs found
Difference of Convex Functions Programming Applied to Control with Expert Data
This paper reports applications of Difference of Convex functions (DC)
programming to Learning from Demonstrations (LfD) and Reinforcement Learning
(RL) with expert data. This is made possible because the norm of the Optimal
Bellman Residual (OBR), which is at the heart of many RL and LfD algorithms, is
DC. Improvement in performance is demonstrated on two specific algorithms,
namely Reward-regularized Classification for Apprenticeship Learning (RCAL) and
Reinforcement Learning with Expert Demonstrations (RLED), through experiments
on generic Markov Decision Processes (MDP), called Garnets
Imitation Learning Applied to Embodied Conversational Agents
International audienceEmbodied Conversational Agents (ECAs) are emerging as a key component to allow human interact with machines. Applications are numerous and ECAs can reduce the aversion to interact with a machine by providing user-friendly interfaces. Yet, ECAs are still unable to produce social signals appropriately during their interaction with humans, which tends to make the interaction less instinctive. Especially, very little attention has been paid to the use of laughter in human-avatar interactions despite the crucial role played by laughter in human-human interaction. In this paper, methods for predicting when and how to laugh during an interaction for an ECA are proposed. Different Imitation Learning (also known as Apprenticeship Learning) algorithms are used in this purpose and a regularized classification algorithm is shown to produce good behavior on real data
Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning
Understanding human behavior from observed data is critical for transparency
and accountability in decision-making. Consider real-world settings such as
healthcare, in which modeling a decision-maker's policy is challenging -- with
no access to underlying states, no knowledge of environment dynamics, and no
allowance for live experimentation. We desire learning a data-driven
representation of decision-making behavior that (1) inheres transparency by
design, (2) accommodates partial observability, and (3) operates completely
offline. To satisfy these key criteria, we propose a novel model-based Bayesian
method for interpretable policy learning ("Interpole") that jointly estimates
an agent's (possibly biased) belief-update process together with their
(possibly suboptimal) belief-action mapping. Through experiments on both
simulated and real-world data for the problem of Alzheimer's disease diagnosis,
we illustrate the potential of our approach as an investigative device for
auditing, quantifying, and understanding human decision-making behavior
Semi-Counterfactual Risk Minimization Via Neural Networks
Counterfactual risk minimization is a framework for offline policy
optimization with logged data which consists of context, action, propensity
score, and reward for each sample point. In this work, we build on this
framework and propose a learning method for settings where the rewards for some
samples are not observed, and so the logged data consists of a subset of
samples with unknown rewards and a subset of samples with known rewards. This
setting arises in many application domains, including advertising and
healthcare. While reward feedback is missing for some samples, it is possible
to leverage the unknown-reward samples in order to minimize the risk, and we
refer to this setting as semi-counterfactual risk minimization. To approach
this kind of learning problem, we derive new upper bounds on the true risk
under the inverse propensity score estimator. We then build upon these bounds
to propose a regularized counterfactual risk minimization method, where the
regularization term is based on the logged unknown-rewards dataset only; hence
it is reward-independent. We also propose another algorithm based on generating
pseudo-rewards for the logged unknown-rewards dataset. Experimental results
with neural networks and benchmark datasets indicate that these algorithms can
leverage the logged unknown-rewards dataset besides the logged known-reward
dataset
Primal Wasserstein Imitation Learning
Imitation Learning (IL) methods seek to match the behavior of an agent with
that of an expert. In the present work, we propose a new IL method based on a
conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL),
which ties to the primal form of the Wasserstein distance between the expert
and the agent state-action distributions. We present a reward function which is
derived offline, as opposed to recent adversarial IL algorithms that learn a
reward function through interactions with the environment, and which requires
little fine-tuning. We show that we can recover expert behavior on a variety of
continuous control tasks of the MuJoCo domain in a sample efficient manner in
terms of agent interactions and of expert interactions with the environment.
Finally, we show that the behavior of the agent we train matches the behavior
of the expert with the Wasserstein distance, rather than the commonly used
proxy of performance.Comment: Published in International Conference on Learning Representations
(ICLR 2021
- …