766 research outputs found
Incorporating Joint Embeddings into Goal-Oriented Dialogues with Multi-Task Learning
Attention-based encoder-decoder neural network models have recently shown
promising results in goal-oriented dialogue systems. However, these models
struggle to reason over and incorporate state-full knowledge while preserving
their end-to-end text generation functionality. Since such models can greatly
benefit from user intent and knowledge graph integration, in this paper we
propose an RNN-based end-to-end encoder-decoder architecture which is trained
with joint embeddings of the knowledge graph and the corpus as input. The model
provides an additional integration of user intent along with text generation,
trained with a multi-task learning paradigm along with an additional
regularization technique to penalize generating the wrong entity as output. The
model further incorporates a Knowledge Graph entity lookup during inference to
guarantee the generated output is state-full based on the local knowledge graph
provided. We finally evaluated the model using the BLEU score, empirical
evaluation depicts that our proposed architecture can aid in the betterment of
task-oriented dialogue system`s performance.Comment: The Semantic Web - 16th International Conference, ESWC 2019,
Portoro\v{z}, Slovenia, June 2-6, 2019, Proceedings, page 225-23
Sequential Dialogue Context Modeling for Spoken Language Understanding
Spoken Language Understanding (SLU) is a key component of goal oriented
dialogue systems that would parse user utterances into semantic frame
representations. Traditionally SLU does not utilize the dialogue history beyond
the previous system turn and contextual ambiguities are resolved by the
downstream components. In this paper, we explore novel approaches for modeling
dialogue context in a recurrent neural network (RNN) based language
understanding system. We propose the Sequential Dialogue Encoder Network, that
allows encoding context from the dialogue history in chronological order. We
compare the performance of our proposed architecture with two context models,
one that uses just the previous turn context and another that encodes dialogue
context in a memory network, but loses the order of utterances in the dialogue
history. Experiments with a multi-domain dialogue dataset demonstrate that the
proposed architecture results in reduced semantic frame error rates.Comment: 8 + 2 pages, Updated 10/17: Updated typos in abstract, Updated 07/07:
Updated Title, abstract and few minor change
Evaluating the Representational Hub of Language and Vision Models
The multimodal models used in the emerging field at the intersection of
computational linguistics and computer vision implement the bottom-up
processing of the `Hub and Spoke' architecture proposed in cognitive science to
represent how the brain processes and combines multi-sensory inputs. In
particular, the Hub is implemented as a neural network encoder. We investigate
the effect on this encoder of various vision-and-language tasks proposed in the
literature: visual question answering, visual reference resolution, and
visually grounded dialogue. To measure the quality of the representations
learned by the encoder, we use two kinds of analyses. First, we evaluate the
encoder pre-trained on the different vision-and-language tasks on an existing
diagnostic task designed to assess multimodal semantic understanding. Second,
we carry out a battery of analyses aimed at studying how the encoder merges and
exploits the two modalities.Comment: Accepted to IWCS 201
A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System
Natural Language Understanding (NLU) and Natural Language Generation (NLG)
are the two critical components of every conversational system that handles the
task of understanding the user by capturing the necessary information in the
form of slots and generating an appropriate response in accordance with the
extracted information. Recently, dialogue systems integrated with complementary
information such as images, audio, or video have gained immense popularity. In
this work, we propose an end-to-end framework with the capability to extract
necessary slot values from the utterance and generate a coherent response,
thereby assisting the user to achieve their desired goals in a multimodal
dialogue system having both textual and visual information. The task of
extracting the necessary information is dependent not only on the text but also
on the visual cues present in the dialogue. Similarly, for the generation, the
previous dialog context comprising multimodal information is significant for
providing coherent and informative responses. We employ a multimodal
hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge
base (Kb) to provide a stronger context for both the tasks. Finally, we design
a slot attention mechanism to focus on the necessary information in a given
utterance. Lastly, a decoder generates the corresponding response for the given
dialogue context and the extracted slot values. Experimental results on the
Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms
the baselines approaches in both the tasks. The code is available at
https://github.com/avinashsai/slot-gpt.Comment: Published in the journal Multimedia Tools and Application
- …