18,105 research outputs found
Label-less Learning for Traffic Control in an Edge Network
With the development of intelligent applications (e.g., self-driving,
real-time emotion recognition, etc), there are higher requirements for the
cloud intelligence. However, cloud intelligence depends on the multi-modal data
collected by user equipments (UEs). Due to the limited capacity of network
bandwidth, offloading all data generated from the UEs to the remote cloud is
impractical. Thus, in this article, we consider the challenging issue of
achieving a certain level of cloud intelligence while reducing network traffic.
In order to solve this problem, we design a traffic control algorithm based on
label-less learning on the edge cloud, which is dubbed as LLTC. By the use of
the limited computing and storage resources at edge cloud, LLTC evaluates the
value of data, which will be offloaded. Specifically, we first give a statement
of the problem and the system architecture. Then, we design the LLTC algorithm
in detail. Finally, we set up the system testbed. Experimental results show
that the proposed LLTC can guarantee the required cloud intelligence while
minimizing the amount of data transmission
Exploring the contextual factors affecting multimodal emotion recognition in videos
Emotional expressions form a key part of user behavior on today's digital
platforms. While multimodal emotion recognition techniques are gaining research
attention, there is a lack of deeper understanding on how visual and non-visual
features can be used to better recognize emotions in certain contexts, but not
others. This study analyzes the interplay between the effects of multimodal
emotion features derived from facial expressions, tone and text in conjunction
with two key contextual factors: i) gender of the speaker, and ii) duration of
the emotional episode. Using a large public dataset of 2,176 manually annotated
YouTube videos, we found that while multimodal features consistently
outperformed bimodal and unimodal features, their performance varied
significantly across different emotions, gender and duration contexts.
Multimodal features performed particularly better for male speakers in
recognizing most emotions. Furthermore, multimodal features performed
particularly better for shorter than for longer videos in recognizing neutral
and happiness, but not sadness and anger. These findings offer new insights
towards the development of more context-aware emotion recognition and
empathetic systems.Comment: Accepted version at IEEE Transactions on Affective Computin
Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
Related tasks often have inter-dependence on each other and perform better
when solved in a joint framework. In this paper, we present a deep multi-task
learning framework that jointly performs sentiment and emotion analysis both.
The multi-modal inputs (i.e., text, acoustic and visual frames) of a video
convey diverse and distinctive information, and usually do not have equal
contribution in the decision making. We propose a context-level inter-modal
attention framework for simultaneously predicting the sentiment and expressed
emotions of an utterance. We evaluate our proposed approach on CMU-MOSEI
dataset for multi-modal sentiment and emotion analysis. Evaluation results
suggest that multi-task learning framework offers improvement over the
single-task framework. The proposed approach reports new state-of-the-art
performance for both sentiment analysis and emotion analysis.Comment: Accepted for publication in NAACL:HLT-201
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Obtaining large, human labelled speech datasets to train models for emotion
recognition is a notoriously challenging task, hindered by annotation cost and
label ambiguity. In this work, we consider the task of learning embeddings for
speech classification without access to any form of labelled audio. We base our
approach on a simple hypothesis: that the emotional content of speech
correlates with the facial expression of the speaker. By exploiting this
relationship, we show that annotations of expression can be transferred from
the visual domain (faces) to the speech domain (voices) through cross-modal
distillation. We make the following contributions: (i) we develop a strong
teacher network for facial emotion recognition that achieves the state of the
art on a standard benchmark; (ii) we use the teacher to train a student, tabula
rasa, to learn representations (embeddings) for speech emotion recognition
without access to labelled audio data; and (iii) we show that the speech
emotion embedding can be used for speech emotion recognition on external
benchmark datasets. Code, models and data are available.Comment: Conference paper at ACM Multimedia 201
WiFE: WiFi and Vision based Intelligent Facial-Gesture Emotion Recognition
Emotion is an essential part of Artificial Intelligence (AI) and human mental
health. Current emotion recognition research mainly focuses on single modality
(e.g., facial expression), while human emotion expressions are multi-modal in
nature. In this paper, we propose a hybrid emotion recognition system
leveraging two emotion-rich and tightly-coupled modalities, i.e., facial
expression and body gesture. However, unbiased and fine-grained facial
expression and gesture recognition remain a major problem. To this end, unlike
our rivals relying on contact or even invasive sensors, we explore the
commodity WiFi signal for device-free and contactless gesture recognition,
while adopting a vision-based facial expression. However, there exist two
design challenges, i.e., how to improve the sensitivity of WiFi signals and how
to process the large-volume, heterogeneous, and non-synchronous data
contributed by the two-modalities. For the former, we propose a signal
sensitivity enhancement method based on the Rician K factor theory; for the
latter, we combine CNN and RNN to mine the high-level features of bi-modal
data, and perform a score-level fusion for fine-grained recognition. To
evaluate the proposed method, we build a first-of-its-kind Vision-CSI Emotion
Database (VCED) and conduct extensive experiments. Empirical results show the
superiority of the bi-modality by achieving 83.24\% recognition accuracy for
seven emotions, as compared with 66.48% and 66.67% recognition accuracy by
gesture-only based solution and facial-only based solution, respectively. The
VCED database download link is https://github.com/purpleleaves007/WIFE-Dataset.Comment: error in experiment result
Robust Deep Multi-modal Learning Based on Gated Information Fusion Network
The goal of multi-modal learning is to use complimentary information on the
relevant task provided by the multiple modalities to achieve reliable and
robust performance. Recently, deep learning has led significant improvement in
multi-modal learning by allowing for the information fusion in the intermediate
feature levels. This paper addresses a problem of designing robust deep
multi-modal learning architecture in the presence of imperfect modalities. We
introduce deep fusion architecture for object detection which processes each
modality using the separate convolutional neural network (CNN) and constructs
the joint feature map by combining the intermediate features from the CNNs. In
order to facilitate the robustness to the degraded modalities, we employ the
gated information fusion (GIF) network which weights the contribution from each
modality according to the input feature maps to be fused. The weights are
determined through the convolutional layers followed by a sigmoid function and
trained along with the information fusion network in an end-to-end fashion. Our
experiments show that the proposed GIF network offers the additional
architectural flexibility to achieve robust performance in handling some
degraded modalities, and show a significant performance improvement based on
Single Shot Detector (SSD) for KITTI dataset using the proposed fusion network
and data augmentation schemes.Comment: 2018 Asian Conference on Computer Vision (ACCV
Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction
Continuous dimensional emotion prediction is a challenging task where the
fusion of various modalities usually achieves state-of-the-art performance such
as early fusion or late fusion. In this paper, we propose a novel multi-modal
fusion strategy named conditional attention fusion, which can dynamically pay
attention to different modalities at each time step. Long-short term memory
recurrent neural networks (LSTM-RNN) is applied as the basic uni-modality model
to capture long time dependencies. The weights assigned to different modalities
are automatically decided by the current input features and recent history
information rather than being fixed at any kinds of situation. Our experimental
results on a benchmark dataset AVEC2015 show the effectiveness of our method
which outperforms several common fusion strategies for valence prediction.Comment: Appeared at ACM Multimedia 201
Tensor Fusion Network for Multimodal Sentiment Analysis
Multimodal sentiment analysis is an increasingly popular research area, which
extends the conventional language-based definition of sentiment analysis to a
multimodal setup where other relevant modalities accompany language. In this
paper, we pose the problem of multimodal sentiment analysis as modeling
intra-modality and inter-modality dynamics. We introduce a novel model, termed
Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed
approach is tailored for the volatile nature of spoken language in online
videos as well as accompanying gestures and voice. In the experiments, our
model outperforms state-of-the-art approaches for both multimodal and unimodal
sentiment analysis.Comment: Accepted as full paper in EMNLP 201
Annotation and Detection of Emotion in Text-based Dialogue Systems with CNN
Knowledge of users' emotion states helps improve human-computer interaction.
In this work, we presented EmoNet, an emotion detector of Chinese daily
dialogues based on deep convolutional neural networks. In order to maintain the
original linguistic features, such as the order, commonly used methods like
segmentation and keywords extraction were not adopted, instead we increased the
depth of CNN and tried to let CNN learn inner linguistic relationships. Our
main contribution is that we presented a new model and a new pipeline which can
be used in multi-language environment to solve sentimental problems.
Experimental results shows EmoNet has a great capacity in learning the emotion
of dialogues and achieves a better result than other state of art detectors do.Comment: 7 pages, 7 figure
Multimodal Emotion Recognition for One-Minute-Gradual Emotion Challenge
The continuous dimensional emotion modelled by arousal and valence can depict
complex changes of emotions. In this paper, we present our works on arousal and
valence predictions for One-Minute-Gradual (OMG) Emotion Challenge. Multimodal
representations are first extracted from videos using a variety of acoustic,
video and textual models and support vector machine (SVM) is then used for
fusion of multimodal signals to make final predictions. Our solution achieves
Concordant Correlation Coefficient (CCC) scores of 0.397 and 0.520 on arousal
and valence respectively for the validation dataset, which outperforms the
baseline systems with the best CCC scores of 0.15 and 0.23 on arousal and
valence by a large margin
- …