Search CORE

arXiv.org e-Print Archive

Is the Bellman residual a bad proxy?

Author: Geist Matthieu
Pietquin Olivier
Piot Bilal
Publication venue
Publication date: 04/12/2017
Field of study

This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. For that purpose, we place ourselves in the framework of policy search algorithms, that are usually designed to maximize the mean value, and derive a method that minimizes the residual

\|T_* v_\pi - v_\pi\|_{1,\nu}

over policies. A theoretical analysis shows how good this proxy is to policy optimization, and notably that it is better than its value-based counterpart. We also propose experiments on randomly generated generic Markov decision processes, specifically designed for studying the influence of the involved concentrability coefficient. They show that the Bellman residual is generally a bad proxy to policy optimization and that directly maximizing the mean value is much better, despite the current lack of deep theoretical analysis. This might seem obvious, as directly addressing the problem of interest is usually better, but given the prevalence of (projected) Bellman residual minimization in value-based reinforcement learning, we believe that this question is worth to be considered.Comment: Final NIPS 2017 version (title, among other things, changed

HAL-INSU

Imitation Learning Applied to Embodied Conversational Agents

Author: Geist Matthieu
Pietquin Olivier
Piot Bilal
Publication venue: HAL CCSD
Publication date: 11/07/2015
Field of study

International audienceEmbodied Conversational Agents (ECAs) are emerging as a key component to allow human interact with machines. Applications are numerous and ECAs can reduce the aversion to interact with a machine by providing user-friendly interfaces. Yet, ECAs are still unable to produce social signals appropriately during their interaction with humans, which tends to make the interaction less instinctive. Especially, very little attention has been paid to the use of laughter in human-avatar interactions despite the crucial role played by laughter in human-human interaction. In this paper, methods for predicting when and how to laugh during an interaction for an ECA are proposed. Different Imitation Learning (also known as Apprenticeship Learning) algorithms are used in this purpose and a regularized classification algorithm is shown to produce good behavior on real data

HAL - Université de Franche-Comté

HAL - Lille 3

CiteSeerX

Boosted Bellman Residual Minimization Handling Expert Demonstrations

Author: Geist Matthieu
Pietquin Olivier
Piot Bilal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceThis paper addresses the problem of batch Reinforcement Learning with Expert Demonstrations (RLED). In RLED, the goal is to find an optimal policy of a Markov Decision Process (MDP), using a data set of fixed sampled transitions of the MDP as well as a data set of fixed expert demonstrations. This is slightly different from the batch Reinforcement Learning (RL) framework where only fixed sampled transitions of the MDP are available. Thus, the aim of this article is to propose algorithms that leverage those expert data. The idea proposed here differs from the Approximate Dynamic Programming methods in the sense that we minimize the Optimal Bellman Residual (OBR), where the minimization is guided by constraints defined by the expert demonstrations. This choice is motivated by the the fact that controlling the OBR implies controlling the distance between the estimated and optimal quality functions. However, this method presents some difficulties as the criterion to minimize is non-convex, non-differentiable and biased. Those difficulties are overcome via the embedding of distributions in a Reproducing Kernel Hilbert Space (RKHS) and a boosting technique which allows obtaining non-parametric algorithms. Finally, our algorithms are compared to the only state of the art algorithm, Approximate Policy Iteration with Demonstrations (APID) algorithm, in different experimental settings

HAL - Lille 3

Crossref

Predicting when to laugh with structured classification

Author: Geist Matthieu
Pietquin Olivier
Piot Bilal
Publication venue: HAL CCSD
Publication date: 14/09/2014
Field of study

International audienceToday, Embodied Conversational Agents (ECAs) are emerging as natural media to interact with machines. Applications are numerous and ECAs can reduce the technological gap between people by providing user-friendly interfaces. Yet, ECAs are still unable to produce social signals appropriately during their interaction with humans, which tends to make the interaction less instinctive. Especially, very little attention has been paid to the use of laughter in human-avatar interactions despite the crucial role played by laughter in human-human interaction. In this paper, a method for predicting the most appropriate moment for laughing for an ECA is proposed. Imitation learning via a structured classification algorithm is used in this purpose and is shown to produce a behavior similar to humans’ on a practical application: the yes/no game

HAL - Université de Franche-Comté

HAL - Lille 3

Apprentissage par démonstrations : vaut-il la peine d'estimer une fonction de récompense?

Author: Geist Matthieu
Pietquin Olivier
Piot Bilal
Publication venue: HAL CCSD
Publication date: 01/07/2013
Field of study

Cet article propose une étude comparative entre l'Apprentissage par Renforcement Inverse (ARI) et l'Apprentissage par Imitation (AI). L'ARI et l'AI sont deux cadres de travail qui utilisent le concept de Processus Décisionnel de Markov (PDM) et dans lesquels nous cherchons à résoudre le problème d'Apprentissage par Démonstrations (AD). L'AD est un problème où un agent appelé ap- prenti cherche à apprendre à partir de l'observation des démonstrations d'un autre agent appelé expert. Dans le cadre de travail de l'AI, l'apprenti essaie d'apprendre directement la politique de l'expert alors que dans le cadre de l'ARI, l'apprenti essaie d'apprendre la récompense qui explique la politique de l'expert. Cette récompense est ensuite optimisée pour imiter l'expert. On peut donc légitimement se demander s'il y a un intérêt à estimer une récompense qui devra ensuite être optimisée ou si l'estima- tion d'une politique est suffisante. Cette question assez naturelle n'a pas encore été réellement traitée dans la littérature pour l'instant. Ici, des réponses partielles à la fois d'un point de vue théorique et pra- tique sont produites. Mots-clés : Apprentissage par Renforcement Inverse, Apprentissage par Imitation, Apprentissage par Démonstrations

arXiv.org e-Print Archive

Learning from Demonstrations: Is It Worth Estimating a Reward Function?

Author: Geist Matthieu
Pietquin Olivier
Piot Bilal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

International audienceThis paper provides a comparative study between Inverse Reinforcement Learning (IRL) and Apprenticeship Learning (AL). IRL and AL are two frameworks, using Markov Decision Processes (MDP), which are used for the imitation learning problem where an agent tries to learn from demonstrations of an expert. In the AL Framework, the agent tries to learn the expert policy whereas in the IRL Framework, the agent tries to learn a reward which can explain the behavior of the expert. This reward is then optimized to imitate the expert. One can wonder if it is worth estimating such a reward, or if estimating a Policy is sufficient. This quite natural question has not really been addressed in the literature right now. We provide partial answers, both from a theoretical and empirical point of view

End-to-end optimization of goal-driven and visually grounded dialogue systems

Author: Courville Aaron
de Vries Harm
Mary Jeremie
Pietquin Olivier
Piot Bilal
Strub Florian
Publication venue
Publication date: 15/03/2017
Field of study

End-to-end design of dialogue systems has recently become a popular research topic thanks to powerful tools such as encoder-decoder architectures for sequence-to-sequence learning. Yet, most current approaches cast human-machine dialogue management as a supervised learning problem, aiming at predicting the next utterance of a participant given the full history of the dialogue. This vision is too simplistic to render the intrinsic planning problem inherent to dialogue as well as its grounded nature, making the context of a dialogue larger than the sole history. This is why only chit-chat and question answering tasks have been addressed so far using end-to-end architectures. In this paper, we introduce a Deep Reinforcement Learning method to optimize visually grounded task-oriented dialogues, based on the policy gradient algorithm. This approach is tested on a dataset of 120k dialogues collected through Mechanical Turk and provides encouraging results at solving both the problem of generating natural dialogues and the task of discovering a specific object in a complex picture

Crossref