34,866 research outputs found

    PRESENCE: A human-inspired architecture for speech-based human-machine interaction

    No full text
    Recent years have seen steady improvements in the quality and performance of speech-based human-machine interaction driven by a significant convergence in the methods and techniques employed. However, the quantity of training data required to improve state-of-the-art systems seems to be growing exponentially and performance appears to be asymptotic to a level that may be inadequate for many real-world applications. This suggests that there may be a fundamental flaw in the underlying architecture of contemporary systems, as well as a failure to capitalize on the combinatorial properties of human spoken language. This paper addresses these issues and presents a novel architecture for speech-based human-machine interaction inspired by recent findings in the neurobiology of living systems. Called PRESENCE-"PREdictive SENsorimotor Control and Emulation" - this new architecture blurs the distinction between the core components of a traditional spoken language dialogue system and instead focuses on a recursive hierarchical feedback control structure. Cooperative and communicative behavior emerges as a by-product of an architecture that is founded on a model of interaction in which the system has in mind the needs and intentions of a user and a user has in mind the needs and intentions of the system

    Investigating Linguistic Pattern Ordering in Hierarchical Natural Language Generation

    Full text link
    Natural language generation (NLG) is a critical component in spoken dialogue system, which can be divided into two phases: (1) sentence planning: deciding the overall sentence structure, (2) surface realization: determining specific word forms and flattening the sentence structure into a string. With the rise of deep learning, most modern NLG models are based on a sequence-to-sequence (seq2seq) model, which basically contains an encoder-decoder structure; these NLG models generate sentences from scratch by jointly optimizing sentence planning and surface realization. However, such simple encoder-decoder architecture usually fail to generate complex and long sentences, because the decoder has difficulty learning all grammar and diction knowledge well. This paper introduces an NLG model with a hierarchical attentional decoder, where the hierarchy focuses on leveraging linguistic knowledge in a specific order. The experiments show that the proposed method significantly outperforms the traditional seq2seq model with a smaller model size, and the design of the hierarchical attentional decoder can be applied to various NLG systems. Furthermore, different generation strategies based on linguistic patterns are investigated and analyzed in order to guide future NLG research work.Comment: accepted by the 7th IEEE Workshop on Spoken Language Technology (SLT 2018). arXiv admin note: text overlap with arXiv:1808.0274

    A Knowledge-Grounded Multimodal Search-Based Conversational Agent

    Full text link
    Multimodal search-based dialogue is a challenging new task: It extends visually grounded question answering systems into multi-turn conversations with access to an external database. We address this new challenge by learning a neural response generation system from the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded multimodal conversational model where an encoded knowledge base (KB) representation is appended to the decoder input. Our model substantially outperforms strong baselines in terms of text-based similarity measures (over 9 BLEU points, 3 of which are solely due to the use of additional information from the KB

    Improving Context Modelling in Multimodal Dialogue Generation

    Full text link
    In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system. Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain. We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics. We also showcase the shortcomings of current vision and language models by performing an error analysis on our system's output
    corecore