19,866 research outputs found
Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
Related tasks often have inter-dependence on each other and perform better
when solved in a joint framework. In this paper, we present a deep multi-task
learning framework that jointly performs sentiment and emotion analysis both.
The multi-modal inputs (i.e., text, acoustic and visual frames) of a video
convey diverse and distinctive information, and usually do not have equal
contribution in the decision making. We propose a context-level inter-modal
attention framework for simultaneously predicting the sentiment and expressed
emotions of an utterance. We evaluate our proposed approach on CMU-MOSEI
dataset for multi-modal sentiment and emotion analysis. Evaluation results
suggest that multi-task learning framework offers improvement over the
single-task framework. The proposed approach reports new state-of-the-art
performance for both sentiment analysis and emotion analysis.Comment: Accepted for publication in NAACL:HLT-201
Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition
Automatic emotion recognition is a challenging task. In this paper, we
present our effort for the audio-video based sub-challenge of the Emotion
Recognition in the Wild (EmotiW) 2018 challenge, which requires participants to
assign a single emotion label to the video clip from the six universal emotions
(Anger, Disgust, Fear, Happiness, Sad and Surprise) and Neutral. The proposed
multimodal emotion recognition system takes audio, video and text information
into account. Except for handcraft features, we also extract bottleneck
features from deep neutral networks (DNNs) via transfer learning. Both temporal
classifiers and non-temporal classifiers are evaluated to obtain the best
unimodal emotion classification result. Then possibilities are extracted and
passed into the Beam Search Fusion (BS-Fusion). We test our method in the
EmotiW 2018 challenge and we gain promising results. Compared with the baseline
system, there is a significant improvement. We achieve 60.34% accuracy on the
testing dataset, which is only 1.5% lower than the winner. It shows that our
method is very competitive.Comment: 9 pages, 11 figures and 4 Tables. EmotiW2018 challeng
Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention
This paper focuses on two key problems for audio-visual emotion recognition
in the video. One is the audio and visual streams temporal alignment for
feature level fusion. The other one is locating and re-weighting the perception
attentions in the whole audio-visual stream for better recognition. The Long
Short Term Memory Recurrent Neural Network (LSTM-RNN) is employed as the main
classification architecture. Firstly, soft attention mechanism aligns the audio
and visual streams. Secondly, seven emotion embedding vectors, which are
corresponding to each classification emotion type, are added to locate the
perception attentions. The locating and re-weighting process is also based on
the soft attention mechanism. The experiment results on EmotiW2015 dataset and
the qualitative analysis show the efficiency of the proposed two techniques
Multimodal Emotion Recognition for One-Minute-Gradual Emotion Challenge
The continuous dimensional emotion modelled by arousal and valence can depict
complex changes of emotions. In this paper, we present our works on arousal and
valence predictions for One-Minute-Gradual (OMG) Emotion Challenge. Multimodal
representations are first extracted from videos using a variety of acoustic,
video and textual models and support vector machine (SVM) is then used for
fusion of multimodal signals to make final predictions. Our solution achieves
Concordant Correlation Coefficient (CCC) scores of 0.397 and 0.520 on arousal
and valence respectively for the validation dataset, which outperforms the
baseline systems with the best CCC scores of 0.15 and 0.23 on arousal and
valence by a large margin
Multimodal Local-Global Ranking Fusion for Emotion Recognition
Emotion recognition is a core research area at the intersection of artificial
intelligence and human communication analysis. It is a significant technical
challenge since humans display their emotions through complex idiosyncratic
combinations of the language, visual and acoustic modalities. In contrast to
traditional multimodal fusion techniques, we approach emotion recognition from
both direct person-independent and relative person-dependent perspectives. The
direct person-independent perspective follows the conventional emotion
recognition approach which directly infers absolute emotion labels from
observed multimodal features. The relative person-dependent perspective
approaches emotion recognition in a relative manner by comparing partial video
segments to determine if there was an increase or decrease in emotional
intensity. Our proposed model integrates these direct and relative prediction
perspectives by dividing the emotion recognition task into three easier
subtasks. The first subtask involves a multimodal local ranking of relative
emotion intensities between two short segments of a video. The second subtask
uses local rankings to infer global relative emotion ranks with a Bayesian
ranking algorithm. The third subtask incorporates both direct predictions from
observed multimodal behaviors and relative emotion ranks from local-global
rankings for final emotion prediction. Our approach displays excellent
performance on an audio-visual emotion recognition benchmark and improves over
other algorithms for multimodal fusion.Comment: ACM International Conference on Multimodal Interaction (ICMI 2018
Multimodal Relational Tensor Network for Sentiment and Emotion Classification
Understanding Affect from video segments has brought researchers from the
language, audio and video domains together. Most of the current multimodal
research in this area deals with various techniques to fuse the modalities, and
mostly treat the segments of a video independently. Motivated by the work of
(Zadeh et al., 2017) and (Poria et al., 2017), we present our architecture,
Relational Tensor Network, where we use the inter-modal interactions within a
segment (intra-segment) and also consider the sequence of segments in a video
to model the inter-segment inter-modal interactions. We also generate rich
representations of text and audio modalities by leveraging richer audio and
linguistic context alongwith fusing fine-grained knowledge based polarity
scores from text. We present the results of our model on CMU-MOSEI dataset and
show that our model outperforms many baselines and state of the art methods for
sentiment classification and emotion recognition
Human-Centered Emotion Recognition in Animated GIFs
As an intuitive way of expression emotion, the animated Graphical Interchange
Format (GIF) images have been widely used on social media. Most previous
studies on automated GIF emotion recognition fail to effectively utilize GIF's
unique properties, and this potentially limits the recognition performance. In
this study, we demonstrate the importance of human related information in GIFs
and conduct human-centered GIF emotion recognition with a proposed Keypoint
Attended Visual Attention Network (KAVAN). The framework consists of a facial
attention module and a hierarchical segment temporal module. The facial
attention module exploits the strong relationship between GIF contents and
human characters, and extracts frame-level visual feature with a focus on human
faces. The Hierarchical Segment LSTM (HS-LSTM) module is then proposed to
better learn global GIF representations. Our proposed framework outperforms the
state-of-the-art on the MIT GIFGIF dataset. Furthermore, the facial attention
module provides reliable facial region mask predictions, which improves the
model's interpretability.Comment: Accepted to IEEE International Conference on Multimedia and Expo
(ICME) 201
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors
Multimodal research is an emerging field of artificial intelligence, and one
of the main research problems in this field is multimodal fusion. The fusion of
multimodal data is the process of integrating multiple unimodal representations
into one compact multimodal representation. Previous research in this field has
exploited the expressiveness of tensors for multimodal representation. However,
these methods often suffer from exponential increase in dimensions and in
computational complexity introduced by transformation of input into tensor. In
this paper, we propose the Low-rank Multimodal Fusion method, which performs
multimodal fusion using low-rank tensors to improve efficiency. We evaluate our
model on three different tasks: multimodal sentiment analysis, speaker trait
analysis, and emotion recognition. Our model achieves competitive results on
all these tasks while drastically reducing computational complexity. Additional
experiments also show that our model can perform robustly for a wide range of
low-rank settings, and is indeed much more efficient in both training and
inference compared to other methods that utilize tensor representations.Comment: * Equal contribution. 10 pages. Accepted by ACL 201
Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning
Emotion recognition has become an important field of research in Human
Computer Interactions as we improve upon the techniques for modelling the
various aspects of behaviour. With the advancement of technology our
understanding of emotions are advancing, there is a growing need for automatic
emotion recognition systems. One of the directions the research is heading is
the use of Neural Networks which are adept at estimating complex functions that
depend on a large number and diverse source of input data. In this paper we
attempt to exploit this effectiveness of Neural networks to enable us to
perform multimodal Emotion recognition on IEMOCAP dataset using data from
Speech, Text, and Motion capture data from face expressions, rotation and hand
movements. Prior research has concentrated on Emotion detection from Speech on
the IEMOCAP dataset, but our approach is the first that uses the multiple modes
of data offered by IEMOCAP for a more robust and accurate emotion detection
Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction
Continuous dimensional emotion prediction is a challenging task where the
fusion of various modalities usually achieves state-of-the-art performance such
as early fusion or late fusion. In this paper, we propose a novel multi-modal
fusion strategy named conditional attention fusion, which can dynamically pay
attention to different modalities at each time step. Long-short term memory
recurrent neural networks (LSTM-RNN) is applied as the basic uni-modality model
to capture long time dependencies. The weights assigned to different modalities
are automatically decided by the current input features and recent history
information rather than being fixed at any kinds of situation. Our experimental
results on a benchmark dataset AVEC2015 show the effectiveness of our method
which outperforms several common fusion strategies for valence prediction.Comment: Appeared at ACM Multimedia 201
- …