189,286 research outputs found
A Knowledge-Grounded Multimodal Search-Based Conversational Agent
Multimodal search-based dialogue is a challenging new task: It extends
visually grounded question answering systems into multi-turn conversations with
access to an external database. We address this new challenge by learning a
neural response generation system from the recently released Multimodal
Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded
multimodal conversational model where an encoded knowledge base (KB)
representation is appended to the decoder input. Our model substantially
outperforms strong baselines in terms of text-based similarity measures (over 9
BLEU points, 3 of which are solely due to the use of additional information
from the KB
Multimodal Speech Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenging task, and extensive reliance has
been placed on models that use audio features in building well-performing
classifiers. In this paper, we propose a novel deep dual recurrent encoder
model that utilizes text data and audio signals simultaneously to obtain a
better understanding of speech data. As emotional dialogue is composed of sound
and spoken content, our model encodes the information from audio and text
sequences using dual recurrent neural networks (RNNs) and then combines the
information from these sources to predict the emotion class. This architecture
analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that
focus on audio features. Extensive experiments are conducted to investigate the
efficacy and properties of the proposed model. Our proposed model outperforms
previous state-of-the-art methods in assigning data to one of four emotion
categories (i.e., angry, happy, sad and neutral) when the model is applied to
the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.Comment: 7 pages, Accepted as a conference paper at IEEE SLT 201
Learning Multimodal Word Representation via Dynamic Fusion Methods
Multimodal models have been proven to outperform text-based models on
learning semantic word representations. Almost all previous multimodal models
typically treat the representations from different modalities equally. However,
it is obvious that information from different modalities contributes
differently to the meaning of words. This motivates us to build a multimodal
model that can dynamically fuse the semantic representations from different
modalities according to different types of words. To that end, we propose three
novel dynamic fusion methods to assign importance weights to each modality, in
which weights are learned under the weak supervision of word association pairs.
The extensive experiments have demonstrated that the proposed methods
outperform strong unimodal baselines and state-of-the-art multimodal models.Comment: To be appear in AAAI-1
The role of avatars in e-government interfaces
This paper investigates the use of avatars to communicate live message in e-government interfaces. A comparative study is presented that evaluates the contribution of multimodal metaphors (including avatars) to the usability of interfaces for e-government and user trust. The communication metaphors evaluated included text, earcons, recorded speech and avatars. The experimental platform used for the experiment involved two interface versions with a sample of 30 users. The results demonstrated that the use of multimodal metaphors in an e-government interface can significantly contribute to enhancing the usability and increase trust of users to the e-government interface. A set of design guidelines, for the use of multimodal metaphors in e-government interfaces, was also produced
- …
