34,866 research outputs found
PRESENCE: A human-inspired architecture for speech-based human-machine interaction
Recent years have seen steady improvements in the quality and performance of speech-based human-machine interaction driven by a significant convergence in the methods and techniques employed. However, the quantity of training data required to improve state-of-the-art systems seems to be growing exponentially and performance appears to be asymptotic to a level that may be inadequate for many real-world applications. This suggests that there may be a fundamental flaw in the underlying architecture of contemporary systems, as well as a failure to capitalize on the combinatorial properties of human spoken language. This paper addresses these issues and presents a novel architecture for speech-based human-machine interaction inspired by recent findings in the neurobiology of living systems. Called PRESENCE-"PREdictive SENsorimotor Control and Emulation" - this new architecture blurs the distinction between the core components of a traditional spoken language dialogue system and instead focuses on a recursive hierarchical feedback control structure. Cooperative and communicative behavior emerges as a by-product of an architecture that is founded on a model of interaction in which the system has in mind the needs and intentions of a user and a user has in mind the needs and intentions of the system
Investigating Linguistic Pattern Ordering in Hierarchical Natural Language Generation
Natural language generation (NLG) is a critical component in spoken dialogue
system, which can be divided into two phases: (1) sentence planning: deciding
the overall sentence structure, (2) surface realization: determining specific
word forms and flattening the sentence structure into a string. With the rise
of deep learning, most modern NLG models are based on a sequence-to-sequence
(seq2seq) model, which basically contains an encoder-decoder structure; these
NLG models generate sentences from scratch by jointly optimizing sentence
planning and surface realization. However, such simple encoder-decoder
architecture usually fail to generate complex and long sentences, because the
decoder has difficulty learning all grammar and diction knowledge well. This
paper introduces an NLG model with a hierarchical attentional decoder, where
the hierarchy focuses on leveraging linguistic knowledge in a specific order.
The experiments show that the proposed method significantly outperforms the
traditional seq2seq model with a smaller model size, and the design of the
hierarchical attentional decoder can be applied to various NLG systems.
Furthermore, different generation strategies based on linguistic patterns are
investigated and analyzed in order to guide future NLG research work.Comment: accepted by the 7th IEEE Workshop on Spoken Language Technology (SLT
2018). arXiv admin note: text overlap with arXiv:1808.0274
A Knowledge-Grounded Multimodal Search-Based Conversational Agent
Multimodal search-based dialogue is a challenging new task: It extends
visually grounded question answering systems into multi-turn conversations with
access to an external database. We address this new challenge by learning a
neural response generation system from the recently released Multimodal
Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded
multimodal conversational model where an encoded knowledge base (KB)
representation is appended to the decoder input. Our model substantially
outperforms strong baselines in terms of text-based similarity measures (over 9
BLEU points, 3 of which are solely due to the use of additional information
from the KB
Improving Context Modelling in Multimodal Dialogue Generation
In this work, we investigate the task of textual response generation in a
multimodal task-oriented dialogue system. Our work is based on the recently
released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion
domain. We introduce a multimodal extension to the Hierarchical Recurrent
Encoder-Decoder (HRED) model and show that this extension outperforms strong
baselines in terms of text-based similarity metrics. We also showcase the
shortcomings of current vision and language models by performing an error
analysis on our system's output
- …