8,718 research outputs found

    Representation learning for dialogue systems

    Full text link
    Cette thèse présente une série de mesures prises pour étudier l’apprentissage de représentations (par exemple, l’apprentissage profond) afin de mettre en place des systèmes de dialogue et des agents de conversation virtuels. La thèse est divisée en deux parties générales. La première partie de la thèse examine l’apprentissage des représentations pour les modèles de dialogue génératifs. Conditionnés sur une séquence de tours à partir d’un dialogue textuel, ces modèles ont la tâche de générer la prochaine réponse appropriée dans le dialogue. Cette partie de la thèse porte sur les modèles séquence-à-séquence, qui est une classe de réseaux de neurones profonds génératifs. Premièrement, nous proposons un modèle d’encodeur-décodeur récurrent hiérarchique ("Hierarchical Recurrent Encoder-Decoder"), qui est une extension du modèle séquence-à-séquence traditionnel incorporant la structure des tours de dialogue. Deuxièmement, nous proposons un modèle de réseau de neurones récurrents multi-résolution ("Multiresolution Recurrent Neural Network"), qui est un modèle empilé séquence-à-séquence avec une représentation stochastique intermédiaire (une "représentation grossière") capturant le contenu sémantique abstrait communiqué entre les locuteurs. Troisièmement, nous proposons le modèle d’encodeur-décodeur récurrent avec variables latentes ("Latent Variable Recurrent Encoder-Decoder"), qui suivent une distribution normale. Les variables latentes sont destinées à la modélisation de l’ambiguïté et l’incertitude qui apparaissent naturellement dans la communication humaine. Les trois modèles sont évalués et comparés sur deux tâches de génération de réponse de dialogue: une tâche de génération de réponses sur la plateforme Twitter et une tâche de génération de réponses de l’assistance technique ("Ubuntu technical response generation task"). La deuxième partie de la thèse étudie l’apprentissage de représentations pour un système de dialogue utilisant l’apprentissage par renforcement dans un contexte réel. Cette partie porte plus particulièrement sur le système "Milabot" construit par l’Institut québécois d’intelligence artificielle (Mila) pour le concours "Amazon Alexa Prize 2017". Le Milabot est un système capable de bavarder avec des humains sur des sujets populaires à la fois par la parole et par le texte. Le système consiste d’un ensemble de modèles de récupération et de génération en langage naturel, comprenant des modèles basés sur des références, des modèles de sac de mots et des variantes des modèles décrits ci-dessus. Cette partie de la thèse se concentre sur la tâche de sélection de réponse. À partir d’une séquence de tours de dialogues et d’un ensemble des réponses possibles, le système doit sélectionner une réponse appropriée à fournir à l’utilisateur. Une approche d’apprentissage par renforcement basée sur un modèle appelée "Bottleneck Simulator" est proposée pour sélectionner le candidat approprié pour la réponse. Le "Bottleneck Simulator" apprend un modèle approximatif de l’environnement en se basant sur les trajectoires de dialogue observées et le "crowdsourcing", tout en utilisant un état abstrait représentant la sémantique du discours. Le modèle d’environnement est ensuite utilisé pour apprendre une stratégie d’apprentissage du renforcement par le biais de simulations. La stratégie apprise a été évaluée et comparée à des approches concurrentes via des tests A / B avec des utilisateurs réel, où elle démontre d’excellente performance.This thesis presents a series of steps taken towards investigating representation learning (e.g. deep learning) for building dialogue systems and conversational agents. The thesis is split into two general parts. The first part of the thesis investigates representation learning for generative dialogue models. Conditioned on a sequence of turns from a text-based dialogue, these models are tasked with generating the next, appropriate response in the dialogue. This part of the thesis focuses on sequence-to-sequence models, a class of generative deep neural networks. First, we propose the Hierarchical Recurrent Encoder-Decoder model, which is an extension of the vanilla sequence-to sequence model incorporating the turn-taking structure of dialogues. Second, we propose the Multiresolution Recurrent Neural Network model, which is a stacked sequence-to-sequence model with an intermediate, stochastic representation (a "coarse representation") capturing the abstract semantic content communicated between the dialogue speakers. Third, we propose the Latent Variable Recurrent Encoder-Decoder model, which is a variant of the Hierarchical Recurrent Encoder-Decoder model with latent, stochastic normally-distributed variables. The latent, stochastic variables are intended for modelling the ambiguity and uncertainty occurring naturally in human language communication. The three models are evaluated and compared on two dialogue response generation tasks: a Twitter response generation task and the Ubuntu technical response generation task. The second part of the thesis investigates representation learning for a real-world reinforcement learning dialogue system. Specifically, this part focuses on the Milabot system built by the Quebec Artificial Intelligence Institute (Mila) for the Amazon Alexa Prize 2017 competition. Milabot is a system capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language retrieval and generation models, including template-based models, bag-of-words models, and variants of the models discussed in the first part of the thesis. This part of the thesis focuses on the response selection task. Given a sequence of turns from a dialogue and a set of candidate responses, the system must select an appropriate response to give the user. A model-based reinforcement learning approach, called the Bottleneck Simulator, is proposed for selecting the appropriate candidate response. The Bottleneck Simulator learns an approximate model of the environment based on observed dialogue trajectories and human crowdsourcing, while utilizing an abstract (bottleneck) state representing high-level discourse semantics. The learned environment model is then employed to learn a reinforcement learning policy through rollout simulations. The learned policy has been evaluated and compared to competing approaches through A/B testing with real-world users, where it was found to yield excellent performance

    A Neural Network Approach to Intention Modeling forUser-Adapted Conversational Agents

    Get PDF
    Spoken dialogue systems have been proposed to enable a more natural and intuitive interaction with the environment andhuman-computer interfaces. In this contribution, we present a framework based on neural networks that allows modeling of theuser’s intention during the dialogue and uses this prediction todynamically adapt the dialoguemodel of the system taking intoconsideration the user’s needs and preferences. We have evaluated our proposal to develop a user-adapted spoken dialogue systemthat facilitates tourist information and services and provide a detailed discussion of the positive influence of our proposal in thesuccess of the interaction, the information and services provided, and the quality perceived by the users

    The Dialog State Tracking Challenge Series: A Review

    Get PDF
    In a spoken dialog system, dialog state tracking refers to the task of correctly inferring the state of the conversation -- such as the user's goal -- given all of the dialog history up to that turn.  Dialog state tracking is crucial to the success of a dialog system, yet until recently there were no common resources, hampering progress.  The Dialog State Tracking Challenge series of 3 tasks introduced the first shared testbed and evaluation metrics for dialog state tracking, and has underpinned three key advances in dialog state tracking: the move from generative to discriminative models; the adoption of discriminative sequential techniques; and the incorporation of the speech recognition results directly into the dialog state tracker.  This paper reviews this research area, covering both the challenge tasks themselves and summarizing the work they have enabled

    Developing attribute acquisition strategies in spoken dialogue systems via user simulation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 159-169).A spoken dialogue system (SDS) is an application that supports conversational interaction with a human to perform some task. SDSs are emerging as an intuitive and efficient means for accessing information. A critical barrier to their widespread deployment remains in the form of communication breakdown at strategic points in the dialogue, often when the user tries to supply a named entity from a large or open vocabulary set. For example, a weather system might know several thousand cities, but there is no easy way to inform the user about what those cities are. The system will likely misrecognize any unknown city as some known city. The inability of a system to acquire an unknown value can lead to unpredictable behavior by the system, as well as by the user. This thesis presents a framework for developing attribute acquisition strategies with a simulated user. We specifically focus on the acquisition of unknown city names in a flight domain, through a spell-mode subdialogue. Collecting data from real users is costly in both time and resources. In addition, our goal is to focus on situations that tend to occur sporadically in real dialogues, depending on the domain and the user's experience in that domain.(cont.) Therefore, we chose to employ user simulation, which would allow us to generate a large number of dialogues, and to configure the input as desired in order to exercise specific strategies. We present a novel method of utterance generation for user simulation, that exploits an existing corpus of real user dialogues, but recombines the utterances using an example-based, template approach. Items of interest not in the corpus, such as foreign or unknown cities, can be included by splicing in synthesized speech. This method allows us to produce realistic utterances by retaining the structural variety of real user utterances, while introducing cities that can only be resolved via spelling. We also developed a model of generic dialogue management, allowing a developer to quickly specify interaction properties on a per-attribute basis. This model was used to assess the effectiveness of various combinations of dialogue strategies and simulated user behavior. Current approaches to user simulation typically model simulated utterances at the intention level, assuming perfect recognition and understanding. We employ speech to develop our strategies in the context of errors that occur naturally from recognition and understanding.(cont.) We use simulation to address two problems: the conflict problem requires the system to choose how to act when a new hypothesis for an attribute conflicts with its current belief, while the compliance problem requires the system to decide whether a user was compliant with a spelling request. Decision models were learned from simulated data, and were tested with real users, showing that the learned model significantly outperformed a heuristic model in choosing the "ideal" response to the conflict problem, with accuracies of 84.1% and 52.1%, respectively. The learned model to predict compliance achieved a respectable 96.3% accuracy. These results suggest that such models learned from simulated data can attain similar, if not better, performance in dialogues with real users.by Edward A. Filisko.Ph.D

    Evaluating intelligent personal assistants for L2 listening and speaking development

    Get PDF
    While the use of intelligent personal assistants (IPAs) has exploded in recent years, little is known about their use to promote English as a foreign language (EFL) development. Thus, this study addresses this gap in the literature by examining the in-class use of the IPA, Alexa, among second language (L2) English students to support improvements in listening comprehension and speaking proficiency. The study utilized a quasi-experimental design with an experimental group (n = 13) which took part in a 10-week treatment of student-IPA interaction and a control group (n = 15) which did not. Results from the Mann-Whitney U test found that the experimental group was able to make more significant gains in L2 speaking proficiency. However, a significant difference was not found when comparing improvements in L2 listening comprehension. These findings suggest that IPAs may be a useful tool to promote L2 speaking skills and underscore the necessity for additional research on the emerging technology for language learning

    Perception and Acquisition of Natural Authentic English Speech for Chinese Learners Using DIT\u27s Speech Technologies

    Get PDF
    Given that Chinese language learners are greatly influenced by their mother-tongue, which is a tone language rather than an intonation language, learning and coping with authentic English speech seems more difficult than for learners of other languages. The focus of the current research is, on the basis of analysis of the nature of spoken English and spoken Chinese, to help Chinese learners derive benefit from ICT technologies developed by the Technological University Dublin (DIT). The thesis concentrates on investigating the application of speech technologies in bridging the gap between students’ internalised, idealised formulations and natural, authentic English speech. Part of the testing carried out by the present author demonstrates the acceptability of a slow-down algorithm in facilitating Chinese learners of English in re-producing formulaic language. This algorithm is useful because it can slow down audio files to any desired speed between 100% and 40% without distortion, so as to allow language learners to pay attention to the real, rapid flow of ‘messy’ speech and follow the intonation patterns contained in them. The rationale for and the application of natural, dialogic native-to-native English speech to language learning is also explored. The Chinese language learners involved in this study are exposed to authentic, native speech patterns by providing them access to real, informal dialogue in various contexts. In the course of this analysis, the influence of speed of delivery and pitch range on the categorisation of formulaic language is also investigated. The study investigates the potential of the speech tools available to the present author as an effective EFL learning facility, especially for speakers of tone languages, and their role in helping language learners achieve confluent interaction in an English L1 environment
    corecore